This all day pre-conference session was a great way to kickoff Ignite! The session was delivered by Rafal Lukawiecki, a great gentlemen with a lot of enthusiasm and passion for his subject. He was able to capture our attention from beginning to end on a subject as difficult as this. Check out the web site of his company, he provided us with a free access to his company’s content for a month, which is a great way to learn more about the topic. One thing that really interests me is the upcoming 4 day training in Chicago on the subject. While we learned quite a bit today, having more in depth knowledge of the subject would definitely be nice!
It was a good review for me of some tools I had forgotten about (Data Mining Add-In for Excel, SQL Server Data Tools) and a good opportunity to understand better some new tools:
– The R language with RStudio and Rattle (a tool to help you build quickly R code)
– Azure ML Studio
One thing that was interesting throughout the session was seeing how much SSAS data mining was still a very relevant and highly productive tool for machine learning. It offers great visualizations in SQL Server Data Tools that are not offered yet in the Azure ML product with the exception of box plots and variable histograms (Thanks Rafal for repointing that out!). According to the speaker, more visualizations should be coming fairly soon in Azure ML.
However, the ability of being able to take a data mining model from experimentation to production through a REST API based service in Azure ML is a really nice advantage. I could easily see myself build a predictive model and integrate it using PowerShell for some infrastructure problems or in C# for some of our line of business applications. While you could do something similar with DMX predictive queries, you might need to wrap this as a web service to improve/extend the integration capabilities.
More importantly, I learned a more formal way of approaching a data mining/machine learning exercise by using a way to compare the performance of mining algorithms. I also learned about the composition of the team required to be able to answer predictive analytics questions:
– A domain expert (usually someone from the business with intimate knowledge of the problem area -> SME)
– A data expert (someone who knows the company’s data, its location and intricacies)
– A data scientist (someone with expert knowledge of the various tools, algorithms and techniques to explore data)
Abut 80% of the time spent by the data scientist will be in the data preparation before it’s really usable for a data mining algorithm to consume. You’re basically looking to denormalize the data as much as possible to give each case (or record) a maximum number of variables for the algorithm to look at. The speaker mentioned that usually you would have between 20 to 200 case variables and this number can even go in the thousands. It was mentioned that some tools are now aiming to have support for 10000 variables. When you’re using feature extraction on text data, that can quickly increase the number of variables used in your data mining model.
One thing I learned today is that you can use a box plot to review your data while you’re preparing it. This visualization technique shows a lot of useful information about your data in a concise fashion. You can quickly spot:
While this is a great data exploration technique, you will have to do this in R instead of SSAS/Excel/Azure ML.
Throughout the day we used a sample dataset that represented customers with a particular set of attributes. The overall goal was to use data mining algorithms to make predictions about which customers would be the most likely to buy a certain number of cars. We learned how to evaluate the quality of a data mining model by looking certain metrics such as precision and recall/sensitivity. The overall goal is to find the right balance between the precision of the model and its sensitivity. You basically don’t want a model that generates too many false positives
(too sensitive) as it might have a negative impact on your business. Either through too much effort investigating the false alarms or by potentially upsetting your customer (i.e. in a fraud detection case). You also want to have a model that properly identifies the real cases you’re looking for.
Looking at lift and ROC charts helps you figure out how effective you model is to make predictions. The goal is to have a model that generates high probability predictions by looking at a small amount of data.
Another way to validate your model is by doing A-B testing where you influence part of the dataset and comparing if your predictive model changed anything ultimately. You can also just validate the predictions with business people to see if the model makes sense.
I also learned a couple of techniques to help improve the model accuracy. One of them is called sweep parameters, where you basically tuned the data mining algorithms by feeding it different parameters and then comparing the core metrics evaluating the prediction’s quality. You can also ask the automated tuning process to tune the model to either favor sensitivity or precision depending on what you’re shooting for.
More intuitively, doing things likes feature engineering and selection greatly influence the predictive abilities of the model. However, you need to be careful while selecting the variables used
I’m now more prepared to roll up my sleeves at start data mining again with those newly acquired techniques!