SQL query optimization and testing
SQL query optimization At its simplest level the JOIN operator in SQL combines records from two or more tables in a relational database and returns the results in a new set. This is one of those algebraic set theory problems that people find easy to grasp, but difficult to fully understand all the implications of doing such a thing. It is, however, the KEY concept in relational databases. The way the SQL Engine (that would be VistaDB in this case) performs these join operations can vastly impact performance. This has been the focus of much of our optimization efforts over the past month. The ability to join result sets from tables in a more efficient manner. If all queries were built the same, it would be easy. The problem is that the predicate clause of the join (=, WHERE, IN, AND, OR, etc) vastly changes the way in which you can optimize these operations. You default (and slowest) implementation usually looks something like a series of selects statements (result sets) that are then compared one by one by looping through the tables and making sure that an item belongs in the final result set. This can take a huge amount of RAM for large result sets, and the number of rows included in the join syntax can impact the query exponentially. The more rows, you don’t just double the search space, sometimes you x^2 the space. VistaDB Build 27+ includes a new optimizer The new optimizer is not a general purpose optimizer; certain criteria have to be in place before it will attempt to optimize the query. For example if a query contains both a left and right join (they are legal, but you will almost NEVER get the results you think are coming back) the optimizer cannot optimize the operation. But when the criteria is right, it can improve performance by orders of magnitude. The basic idea is similar to a hash join. Rather than store all the data from each result set, you only store some special value that is used for the join operation. In our case we are attempting to build a bitmask. This is similar to a Z-Buffer in computer graphics. You build up a bitmask and XOR / AND the bitmasks together to come up with just the values you need to bother pulling from the disk. New optimizer, new problems The problem we have run into a few times since the new optimizer has been put in place is knowing when to perform the various modes of optimization. In some cases the optimization can optimize result sets too far and exclude valid results. This is why the intensive testing has been requested for Build 28 and 29. There are just so many types of query and optimization combinations that we could not possible test them all. We are trying to build generalized test cases, but the reality is that it would probably take thousands of test cases to get all the various ways join and join predicates can be combined. This is one of those examples why testing a database engine is a very difficult problem. Test framework needed We are continuing to grow our test base of NUnit and functional level tests. The problem in situations like this is that you really need a comparative test (something NUnit is not good at). We need to run a test join in the long, slow, tedious method and then an optimized method and compare the two result sets to verify they are identical. This is turning into a test framework, more than an NUnit style test. The level of tests required to reach code coverage of an application is high, but not nearly as high as attempting to test each objects interactions with each other as well. I don’t know what the correct answer is at the moment, but we will continue to work on it. I have a feeling we are going to end up with a testing framework for SQL. I have tried without success to find public test suites to do what we need. I am amazed that the open source SQL databases appear to have very little in the realm of these types of test frameworks. Most of them have hardcoded test suites of the type “When you perform the following operation on the following table, this is the answer”. That means you have to write all these tests, and fix them if one breaks. I would prefer something that can take a problem space, and automate the tests. My ideal solution would involve something that can run an NP complete type of test for a subset of results from the sql node space. But, if it were that easy… Everyone would do it. Thanks for your help and feedback I want to thank everyone for your incredible effort in testing the current development builds. It has been a fantastic effort from the entire community to really beat the engine up and make it work. My goal, as always, is to make VistaDB the best engine available on the market. I know we can do it, with your help. Thanks.
Similar Posts
- SQL Server 2008 (Katmai) Information
- Performance numbers and VistaDB
- Database Optimizations – an art, not a science – Dev Build 27
