GPU Accelerated Databases

February 27, 2008

I was reading about General Purpose GPU programming on http://www.gpgpu.org  and I started to think about how GPU could be leveraged in database technology. One use that came to my mind immediately was for geospatial and geometrical data. I’m far from being an expert in that matter, but I would think that one could offload most of the calculations to a GPU. Both raster and vectorial maps can benefit the use of a GPU since it can handle both bitmaps and vectorial data like a real champion.

 Another use that I think might work is for indexing data. If you represent data using geometrical patterns, it would be thinkable to use a GPU to perform pattern matching in a very efficient manner due the highly parallel nature of those processing units. If you combine those patterns with set theory, you could define patterns that encompass the actual data. By combining geometrical pattern, those indices would be able to determine the data that is to be included in queries involving a wide range of aggregations and computations.

I’d be curious to see with the CLR integration in SQL Server, if one could call DirectX libraries to offload some work to a GPU.

Feel free to comment on the post, I’m learning and thinking out loud here! I’ll probably be posting more thoughts about this in the coming posts. I’m already thinking about applications for data minining and OLAP data…

Here’s another feature that I would like to see in SQL Server. When it comes to managing any database, IO management becomes a critical task. It’s something I feel most DBMS have not addressed. Data placement is most of the time a manual and very time consuming thing to do. For the rare occasions that we are forced to manage it because of a performance issue,  the resulting work might only be a temporary solution as the data access paterns will change over time. Most storage units can’t mix fibre channel and SATA drives in the same RAID volume with the exception of Compellent and a few other storage array vendor.

 It would be nice to see in a DBMS or the OS itself, the ability to spread the same data file over a mix of solid state, fibre channel and SATA drives. The software will then take care of migrating the pages  which are used frequently to the faster drives and the ones used unfrequently to the slower drives. This way the most expensive drives would be used to their fullest while the slower and cheaper drives would contain most of the data. Pages could be moved in one direction or another, during idle time or scheduled maintenance, depending on the data access patern detected by the software. As a rule of thumb, data residing in the buffer pool would most likely be placed on the faster drives.

SQL Server Cached Result Sets

February 19, 2008

People familiar with SQL Server know that the database engine caches data pages in memory for faster access to queries. There are a series of algorithms governing their life expectancy. This allows the query processor to fulfill a wide range of requests by using frequently accessed pages. This usually works great for simple queries but when we get to more complex requests, the performance gain diminishes quickly. This is caused before the pages have to be reprocessed to answer the needs of a particular query that might join tables, filter data, perform calculations, etc. While I was in TechEd a few years ago, I had a discussion with one of the program manager on SQL Server where I proposed the possible caching of query results.

 The database engine could cache the outcome of queries. SQL Server could take care of tracking which result sets should be invalidated by tracking the lineage of the result set. This means that if a page of data is modified, the result set which are dependent on it should get invalidated and the memory freed. Since developers try to minimize the amount of data return to the application or reports, we could expect huge gains in performance for the users. Scenarios such as BI would greatly benefit from this, as the dashboards and reports query content could be delivered instantly without the need to resolve complex joins and aggregation; a little bit in the way Analysis Services performs aggregations and stores them in the cube. The persistence of the results could also be based on criteria similar to the one used by regular data pages. In certain scenarios where the data is static, those results could even be persisted transparently to disk as the last stage before it gets deleted completely from SQL Server’s cache.

 Another advantage this might have, is that the DBMS is the closest to the data and is aware of the changes happening on it. If you combine this with traditional caching techniques present in applications with a notification mechanism from the database engine, you could keep the application’s cache closely synchronized, without potentially any polling necessary. If we take this one step further, we could say that once a certain result set gets invalidated by an updated to its underlying data, it could get reconstructed based on database policies and a notifications could be sent to the clients asking them to refresh a certain result sets. Since the result sets are based on the outcome of a certain query, the query handle in the plan cache as its hook to a certain result sets, this way, the DBMS only has to figure what query we’re trying to execute and deliver the result back to the client.

Using this kind of functionality, it would be fairly easy to decouple this functionality from the core DBMS to have it hosted on a separate server dedicated to the delivery of those cached result sets. The job of the core DBMS could revolve more around keeping the cache synched those “cache servers” and maintaining data integrity on persistent storage. If data pages could also be propagated from the core DBMS to the cache servers, each server could rebuild their result sets itself, therefore distributing the load accross potentially multiple servers. Since only the pages changed would be propagated from the core to the cache servers, the traffic and load should be minimal. We  could take this one step further and only propagate the data pages to only servers with dependencies on that particular page.

I was hoping to see this in SQL Server 2008, but I guess it might be coming to SQL Server vNext! :-)