As I wanted to have a near real-time dashboard of some of our core databases, I started to look at various solutions to achieve this. It also didn’t hurt that I worked with another proof of concept using CDC in C# and Stream Insight in the past. It gave me some inspiration on that new process and I also took the opportunity to rewrite/improve some of the parts when writing the new PowerShell code.
Here are some of the pieces I ended up selecting:
Dashboard : Microsoft Power BI
Data Extraction: PowerShell
Message Queue: Azure Event Hub
Real-time aggregation engine: Azure Stream Analytics
Darabase Change Tracking: SQL Server Change Data Capture
- Configure which table will be tracked by the DataCollector PowerShell process
- Configure which columns on those table will be collected
- Enumerate the databases that will be collected
- Enumerate the tables collected
- Enumerate the columns collected
- Capture the CDC changes starting after the last LSN that was captured and output that change as a custom PSObject
- The PSObject is then converted to a hashtable to facilitate the transmission to Azure Event Hub
- The hashtable is then converted to JSON and included as the body of an Azure Event Hub Message.
- The message is then queued in Azure Event Hub
- The Azure Stream Analytics will then pull the messages from the Azure Event Hub queue
- Aggregation/filtering will be performed as per the query defined
- The result of the query is then sent as a record in a Power BI table in a dataset.
- The Power BI report is based on the dataset, which, when included in a Power BI dashboard, will refresh as new data comes in.
Here’s how the collection process would be invoked:
Get-DataCollectorCDCChange | Convert-DataCollectorCDCChangeToHashTable | Send-SBEventHubMessage -url "<path to the event hub queue" -sasKey "the event hub SAS key (authentication)"
The overall process has a fairly low impact on your source SQL Server. In our case, we’re talking below 25 reads per collection and 15ms of CPU time.
Here’s a diagram of what the process looks like at a high level:
Here’s more information on the various pieces involved in the process.
Set-DataCollectorCDCConfiguration: Creates the required tables to track the DataCollector operations. It will create the CDCTrackedTables (tracks which table on which server, database, schema) and CDCTrackedTableColumns (which tracks which columns are captured per tracked table).
Get-SQLCDCTable: Lists the tables on a server/database that have CDC enabled.
New-DataCollectorCDCTrackedTable: Creates a record in the CDCTrackedTables table in the DataCollector configuration database. This determines that this table will be collected by the process. It also keeps track of the last LSN that was captured in the process. By default, when the configuration is created for the table, it will take the first LSN that’s available for that object.
New-DataCollectorCDCTrackedTableColumn: This will create a configuration record in the CDCTrackedTableColumns table that contains the name of the column that will be captured by the process. By default, no columns are configured.
Get-DataCollectorCDCChange: Overarching process for the data collection, contains the loop that will call Get-DataCollectorCDCTrackedDatabase at specific interval.
Get-DataCollectorCDCTrackedDatabase: Gets the list of databases for which the CDC changes will be collected.
Get-DataCollectorCDCTrackedTable: Gets the list of tables for a table for which the CDC changes will be collected.
Get-DataCollectordCDCTrackedTableColumn: Gets the list of columns that will be extracted from the CDC change record. This allows you to select only a few columns that will be sent further down the process. This helps keeping the data/event message size down that will be sent to Azure Event Hub.
Get-DataCollectorCDCDatabaseChange: Gets the CDC change records from the SQL Server instance/database/table selected. A custom PSObject is sent to the output stream that contains the columns selected by Get-DataCollectordCDCTrackedTableColumn.
Set-DataCollectorCDCTrackedTableLastLSN: Updates the CDCTrackedTables record in the DataCollector database with the last LSN captured by the data collection process.
Convert-DataCollectorCDCChangeToHashTable: Converts the PSObject to a hashtable to facilitate sending the Azure Event Hub message.
Send-SBEventHubMessage: Sends the CDC change message to Azure Event Hub using JSON as message body.
Azure Event Hub
An Azure Event Hub queue is required to store all the CDC change messages. Those will then be consumed by the Azure Stream Analytics job.
Azure Stream Analytics
An Azure Stream Analytics job is required to process the CDC change messages to perform the filtering/aggregation. You first configure the job input to point to the previously created Event Hub Queue. You will then create a new output that will point to a Power BI account.
You can then define the query based on the defined input and output. Here’s an example of the query I’ve used to generate the dashboard:
SELECT ServerName, DatabaseName, SchemaName, TableName, _CDC_OperationDescription COUNT(*) INTO [power-bi-output] FROM [cdc-eventhub] GROUP BY ServerName, DatabaseName, SchemaName, TableName, TUMBLINGWINDOW(mi,2)
In that particular example, a simple COUNT aggregation is performed on all the messages received per server,database,schema and table. That aggregation operates on tumbling window of 2 minutes. Once the window ends, the result of the aggregation is sent to the output, which in our case is a Power BI table in a dataset.
Once that’s defined, you can then start the Azure Stream Analytics job. The process takes a few minutes, once the job has started, it will start outputting data to Power BI. Note that the Power BI dataset/table will not be created until the job output data.
The Power BI part is pretty straightforward. You first define a report that connects to your Azure Stream Analytics dataset/table. You then pin that report to an existing or a new Power BI dashboard. Once you are in the dashboard view and the Azure Stream Analytics job outputs data, you will then see the graph update as new data is received.
Here’s what the end results looks like:
Here are some of the future improvements I would like to add:
- Add a way to enrich the captured data (i.e. foreign key descriptions)
- By adding reference tables in Azure SQL or
- By enriching the data as it’s going through the PowerShell pipeline
- Add infrastructure to orchestrate collection process across multiple servers. i.e Right now the process assumes that all the data collection is done from a single server.
The first release of the code is available on Codeplex with version 126.96.36.199 and forward. I hope this might be valuable to some of you! Should you have any questions regarding this, feel free to let me know through the comments below!
For more information