Here’s an interesting research paper I found that seems to support my previous blog post:

http://gamma.cs.unc.edu/DB/main.pdf

GPU Accelerated Databases

February 27, 2008

I was reading about General Purpose GPU programming on http://www.gpgpu.org  and I started to think about how GPU could be leveraged in database technology. One use that came to my mind immediately was for geospatial and geometrical data. I’m far from being an expert in that matter, but I would think that one could offload most of the calculations to a GPU. Both raster and vectorial maps can benefit the use of a GPU since it can handle both bitmaps and vectorial data like a real champion.

 Another use that I think might work is for indexing data. If you represent data using geometrical patterns, it would be thinkable to use a GPU to perform pattern matching in a very efficient manner due the highly parallel nature of those processing units. If you combine those patterns with set theory, you could define patterns that encompass the actual data. By combining geometrical pattern, those indices would be able to determine the data that is to be included in queries involving a wide range of aggregations and computations.

I’d be curious to see with the CLR integration in SQL Server, if one could call DirectX libraries to offload some work to a GPU.

Feel free to comment on the post, I’m learning and thinking out loud here! I’ll probably be posting more thoughts about this in the coming posts. I’m already thinking about applications for data minining and OLAP data…

We see application virtualization software emerging more and more in the enterprise with products like Microsoft Application Virtualization and Citrix Application Streaming. I think if there’s a scenario where application virtualization would be well received, it would be for home use of software. This would have few advantages for the users.

First of all, the stability of their PC would most likely increase as each applications operates in a sandboxed environment, which can avoid a lot of headaches. No more complicated setup and configuration for end users as well. Just click on the icon and there it is. Installing an application of IT pros seems like no big deal, but for users at home, it’s a risky operation. Just deploy a stable base operating system image and the user is set.

Another upside to this model, is related to SaaS, as users could only pay for their actual usage of the software. For instance, when I’m at home, I’m not using Microsoft Office 100% of the time, I might only need it a few hours per week. I’m having a hard time paying a few hundred dollars for my copy Office for the usage I have for it at home. I’d much rather pay 100$ for a bank of 10 hours of use per month.

 Hopefully software vendor will get their act together as they did with full OS virtualization. As this article points out, we have a bit of way to go: http://www.news.com/Microsoft-Streaming-Office-infringes-license/2100-1012_3-6229776.html

When you have something like application virtualization, a published desktop or application via Citrix or Terminal Services, why whould you want to build web versions of application like what Google is trying to do with Google Docs?

What would happen if you were to combine Boinc  and Facebook? Probably the most powerful data mining application ever created.

The pure computing power available through the workload distribution of Boinc and the wealth of information present on Facebook would produce one scary application. The kind of application Big Brother would like to have… Maybe he already has! One of the most time consuming task in gathering intelligence is to identify people and their relations with each other. It’s one of the main feature of Facebook. If you combine this with the fact that people increase this knowledge by tagging people in pictures and with the rest of the information that is available, things get very interesting.

Boinc offers 1.06 PetaFLOP/sec, which is twice what BlueGene/L, the fastest supercomputer on the planet delivers. The more users get involved with Boinc, the more its capacity increases. Why would the government pay to build such a large and resilient supercomputer? I would give tax credit to people who contribute to the national computing resource, because it has value. I’m surprised there isn’t a market for that yet! Just as a nation has oil and gold reserves, computing resources are now a way to rank a nation’s power.

 If I would be a hacker, the first thing I would hack is Boinc! I would use it to break Facebook and then use its crunching power to process the data to find Usama Bin Laden and get the 5 million dollars! :-p

Here’s another feature that I would like to see in SQL Server. When it comes to managing any database, IO management becomes a critical task. It’s something I feel most DBMS have not addressed. Data placement is most of the time a manual and very time consuming thing to do. For the rare occasions that we are forced to manage it because of a performance issue,  the resulting work might only be a temporary solution as the data access paterns will change over time. Most storage units can’t mix fibre channel and SATA drives in the same RAID volume with the exception of Compellent and a few other storage array vendor.

 It would be nice to see in a DBMS or the OS itself, the ability to spread the same data file over a mix of solid state, fibre channel and SATA drives. The software will then take care of migrating the pages  which are used frequently to the faster drives and the ones used unfrequently to the slower drives. This way the most expensive drives would be used to their fullest while the slower and cheaper drives would contain most of the data. Pages could be moved in one direction or another, during idle time or scheduled maintenance, depending on the data access patern detected by the software. As a rule of thumb, data residing in the buffer pool would most likely be placed on the faster drives.

SQL Server Cached Result Sets

February 19, 2008

People familiar with SQL Server know that the database engine caches data pages in memory for faster access to queries. There are a series of algorithms governing their life expectancy. This allows the query processor to fulfill a wide range of requests by using frequently accessed pages. This usually works great for simple queries but when we get to more complex requests, the performance gain diminishes quickly. This is caused before the pages have to be reprocessed to answer the needs of a particular query that might join tables, filter data, perform calculations, etc. While I was in TechEd a few years ago, I had a discussion with one of the program manager on SQL Server where I proposed the possible caching of query results.

 The database engine could cache the outcome of queries. SQL Server could take care of tracking which result sets should be invalidated by tracking the lineage of the result set. This means that if a page of data is modified, the result set which are dependent on it should get invalidated and the memory freed. Since developers try to minimize the amount of data return to the application or reports, we could expect huge gains in performance for the users. Scenarios such as BI would greatly benefit from this, as the dashboards and reports query content could be delivered instantly without the need to resolve complex joins and aggregation; a little bit in the way Analysis Services performs aggregations and stores them in the cube. The persistence of the results could also be based on criteria similar to the one used by regular data pages. In certain scenarios where the data is static, those results could even be persisted transparently to disk as the last stage before it gets deleted completely from SQL Server’s cache.

 Another advantage this might have, is that the DBMS is the closest to the data and is aware of the changes happening on it. If you combine this with traditional caching techniques present in applications with a notification mechanism from the database engine, you could keep the application’s cache closely synchronized, without potentially any polling necessary. If we take this one step further, we could say that once a certain result set gets invalidated by an updated to its underlying data, it could get reconstructed based on database policies and a notifications could be sent to the clients asking them to refresh a certain result sets. Since the result sets are based on the outcome of a certain query, the query handle in the plan cache as its hook to a certain result sets, this way, the DBMS only has to figure what query we’re trying to execute and deliver the result back to the client.

Using this kind of functionality, it would be fairly easy to decouple this functionality from the core DBMS to have it hosted on a separate server dedicated to the delivery of those cached result sets. The job of the core DBMS could revolve more around keeping the cache synched those “cache servers” and maintaining data integrity on persistent storage. If data pages could also be propagated from the core DBMS to the cache servers, each server could rebuild their result sets itself, therefore distributing the load accross potentially multiple servers. Since only the pages changed would be propagated from the core to the cache servers, the traffic and load should be minimal. We  could take this one step further and only propagate the data pages to only servers with dependencies on that particular page.

I was hoping to see this in SQL Server 2008, but I guess it might be coming to SQL Server vNext! :-)

I’ve been using VMWare ESX since version 2 and one of my favorite features is the combination of DRS (Distributed Resource Scheduler) and VMotion. After using it in production for a while, I noticed that DRS could be improved. One thing I saw is that the cluster was not balancing CPU and memory load properly accross all nodes. I could have nodes using 80% of their memory with a decent CPU load while others were sitting at 40% memory usage even with the most aggressive DRS settings. What I would like to see in DRS is smarter load balancing. For instance ESX could detect that a certain VM has a high CPU or I/O usage every night and issues a VMotion accordingly before the peak usage. It would be nice to setup priorities for such type of loads. One example that comes to my mind is SharePoint. You usually schedule document indexing during the night, this is typically not a high priority job but during the day you want to be able to deliver good response time for users while they navigate and query the search engine. In that case you could define a time range that specifies the priority a certain VM has over others. This way you could better load balance VMs accross the cluster. For example VMs that are mostly idle could be regrouped on a limited set of nodes and the ones performing intensive operations distributed appropriately on the remaining nodes during that period. It would be interesting to see if data mining of the data in Virtual Center could discover load patterns and correlation between VMs to enhance DRS functionnality.

File duplicates finder

February 4, 2008

A month ago I’ve built a small application with a set of Microsoft SQL Server Reporting Services reports that inventories files on a specific path on a file system or in SharePoint and stores its hash and location in a table. With this you can identify duplicates between a file server and SharePoint sites as well as get storage statistics.  

I’ll be publishing the solution on CodePlex at the following URL:

 http://www.codeplex.com/fdf

 The project was developped in .Net 3.5  using technologies such as Windows Communication Foundation and Parallel FX for multi-threaded file hashing.

Please note that this is a early version and has not been fully tested.  I ran both SharePoint and file system scans on about 130000 files without apparent issues.

 Enjoy!

What Windows 7 should be

February 2, 2008

Here’s my take on some of the features Windows 7 should have:

Distributed processing

Now that we’re getting decent bandwidth in LANs and WANs, it would be nice if Microsoft could develop a layer in the operating system that would work similarly to VMWare VMotion but at the thread level. This means that if the local computing device, could be Windows Mobile phone, laptop, desktop or a server, is overloaded by another thread execution, the load could be offloaded to another node with available processing on the fly. Once a new device is connected to a network, being a private or public network like a wi-fi access point, the users could decide if it wants to participate in the processing. Applications could be built in such a way that they could detect if the computer is part of a larger computing pool and enable extra functionnality for the end-users. The distribution of thread execution could vary based on the bandwidth available and the job at hand. This could also be controlled via policies as well, for instance engineering workstations could have execution priority over a gaming application in a corporate network.

It would be interesting to see libraries such as Parallel FX from Microsoft leverage distributed processing over multiple computers in a transparent fashion. They’re already abstracting job distribution over multiple local CPU, I don’t see any reasons why that couldn’t be extended to remote processing of tasks as well.

 It would be cool to have the operating system capable of routing certain specialized tasks to specific processors such as GPUs. Multiple PCs with graphic cards could be used as a rendering farm transparently for thin clients for instance.

I know distributed processing is not a new thing in academic and sometimes corporate settings, but imagine if a general purpose operating system such as Windows could bring this mainstream. A pool with 100 millions of PC sharing their workload to achieve what was before unthinkable. PC virtualization is nice, but thread execution virtualization is the future!

Distributed storage

Wouldn’t it be nice to have access to a single pool of storage that is almost infinite in size and always available? If you look at one of my previous post Semantic Web, you can find in there an idea of how storage could evolve into something more collaborative and intelligent. Think of it as a giant hard drive with built-in redundancy and load balancing functionnality.

Wrap up

If you combine those two technologies together, imagine how that would change the world of computing. Near unlimited storage and processing available to each of us. This would also help in the area of green computing as we wouldn’t have to build PCs with as much processing capability anymore since we could always leverage the computing capability of our peers in a given network. You could plug-in a PC that boots from a distributed storage pool and startup a data mining application that runs multiple threads over multiple local and remote CPUs. Neato!

Green computing

February 1, 2008

What should be our primary objective? To combine processing, build more energy efficient hardware or change user behaviors? If networking technology is the main barrier for centralized computing, should we focus our efforts in that direction instead of looking at cramming more transistors in handheld devices with poor user experiences and short battery life?

 I think it’s a little bit like in the automotive industry. Which one is the “greener” option? Public transportation or more fuel efficient cars?

 Just food for thoughts here!