Strongly typed performance gains
I have mentioned to a few people that we are working on optimizations and performance internal to the engine. I wanted to take a brief blog post to explain an optimization concept that is subtle, but makes a big difference in runtime performance.
Hashtables
Internally quite a few structures are stored as Hashtables for performance of lookup. This is often not optimal because Hashtables do not understand any native types. A cast must occur from the key type to the internal storage of the Hashtable. This seems like a small thing, but it adds up quickly.
Converting a Hashtabel to a strongly typed Dictionary<string,int> can have a massive impact on performance with almost no change in memory footprint.
List Merger
Originally we discovered this tip from a tool that merges down lists of large text files for the Spam Shield. This tool used a Hashtable and kept lists of domains to be removed or duplicated. During one particularly long and ugly merge (we were literally waiting 12 hours to get a complete merge) we decided to look at other data structures to store this data.
The original code used a Hashtable MergeThese() - We tried to preallocate the size of the array and it did help performance a bit. Then we noticed that the majority of the time was actually being spent in the .Contains() method for the Hashtable.
We replaced the Hashtable with a Dictionary<string, int>. You really don't need the int at all (we leave it at 0), but because the Dictionary is now strongly typed (thanks to generics) your lookups are much faster.
This code in the old system:
if( mergeAddThese.Contains( temp ) == false )
turned into:
if( mergeAddThese.ContainsKey( temp ) == false )
Trivial code change - exact same logic. The Add was similar. The performance blew our socks off.
Input data 500MB of domain data that needed to be merged, sorted, reduced, normalized, etc.
Original Merge - Peak memory
9 h 26 m - 287 MB
Dictionary Merge - Peak memory
1 h 57 m - 269 MB
Basically a 300% performance gain by changing the way a key was looked up in a list. We tried ArrayLists, and other objects, but the templated Dictionary to a string worked the best by far.
Stop Words
Stop Words are the list of words used during Full Text Search indexes for skipping the word. Every word in your Text or VarChar field has to be checked against this list to ensure we are not indexing words of very little benefit to the FTS lookups. Words like "a, an, the, there, etc" are all pretty common for skipping. We have a little over 540 words for the English language (See the help file for the list).
These Stop Words were being stored in a Case Insensitive Hashtable. When we compare the words, we have to ensure that capitalization does not allow the word to skip. You might think of performing a .ToLower() on all the strings or some other solution. When you find yourself thinking of a common problem look in the Framework. 9/10 times Microsoft has your solution. It is called the StringComparer.CurrentCultureIgnoreCase Comparer for string. It takes advantage of the current culture and it is very fast (much faster than lowering your strings actually).
I took the original Hashtable based Stop Words class and tested how much time was taken to build an index on a sample database. The database has around 1000 strings in it of small length. Then I rebuilt the class to use a strongly typed Dictionary and ran the same tests.
Hashtable implementation
Loading Hashtable 471.6495 ms
Build 1 col Index 19.53 ms
Dictionary implementation
Loading Dictionary 2.9292 ms
Build 1 col Index 4.8825 ms
The load only happens once per application database. So it is not a big deal. But it is still nice to shave over 1/2 a second off load time. The real performance was in the FTS Build of the index. For this relatively small index is was 400% faster. I can only imagine the savings as the size of the index goes up.
Optimization
This optimization is in Build 57. We have several more classes that use similar constructs to this that we plan to recode for performance like this. I don't expect a miracle on the overall engine performance, but every little bit helps. And we may find some larger performance bottlenecks while doing this test.
There are also some newer Disk I/O constructs we can use for better performance that I want to try within the DataStorage layer. I will post my findings as we go through this process. I feel like now is the time for us to start looking at many of the internal object classes for performance, temp object allocation, casting, and other practices that rob your application of performance. It should be interesting.

Comments
dpigford on on 4.17.2008 at 10:31 AM
Jason,
Thanks for sharing the tip. After reading this post I tried the same approach (switching from HashTable to Dictionary) in one of our applications. You findings seem to be correct; it is a huge runtime gain. Armed with the NUnit tests, VistaDB is in a great position to dial in on some of these specific issues.
-DP
OmariO on on 4.17.2008 at 11:44 AM
Have you tried HashSet<T>? It stores only keys. msdn2.microsoft.com/.../bb359438.aspx
Useful links regarding performance: http://blogs.msdn.com/ricom/ http://blogs.msdn.com/vancem/default.aspx blogs.msdn.com/.../default.aspx http://blogs.msdn.com/profiler/ forums.microsoft.com/.../ShowPost.aspx
Jason Short on on 4.17.2008 at 12:14 PM
Actually I did not try HashSet<> - didn't think about it. I will have to give that a try and see how it performs, thanks!
Jason Short on on 4.17.2008 at 1:31 PM
It appears that HashSet<> is only supported in 3.5 - didn't know that. We still have to support 2.0 and CF 2.0. Do you know of another alternative that works with 2.0?
Jason Short on on 4.17.2008 at 6:11 PM
As a follow up on this, I took some of the other methods that were using a hashtable and replaced them with strongly typed objects and they resulted in similar gains.
internal class FunctionCollection : Dictionary<string, FunctionDescr>
Now the FunctionCollection is strongly typed and can avoid the previous cast from object to a FunctionDescr class. This resulted in an improved speed for the parser. Every little bit helps, and free performance gains like this are always very welcome.
Hans Nieuwenhuis on on 4.18.2008 at 11:34 PM
Very nice to know that VistaDB will become not only better but also faster with every build! Just got build 57, but I'm already looking forward to build 58. :-)
logicdev on on 4.20.2008 at 2:10 PM
Just a crazy thought but would the following (if implemented in VistaDB obviously) improve performance much?
vcmd.Parameters.Add( new VistaDBParameter<string>("MyCol",VistaDBType.VarChar));
The question comes down to can Parameters be strongly typed by using generics, and if so would it improve performance.
Another approach would be where the VistaDBParameterCollection.Add method internally constructs a VistaDBParameter<> object based on the VistaDBType that was passed.
One result would be a strongly typed VistaDBParameter.Value property. Anyway, just a thought.
Jason Short on on 4.20.2008 at 7:32 PM
I really want to make the parameters strongly typed. It will require a lot of changes though to the current interface (breaking changes too). It is something we are looking at though due to the dramatic performance difference. Right now all the Parameters are treated as just objects, and then cast.
Hans, never worry - VistaDB will continue to get better over time. We have a lot of ideas and energy for future performance.
Hans Nieuwenhuis on on 4.22.2008 at 1:37 AM
Jason,
I'm totally not worried...I have a lot of trust in your company / developers!
For me it's already good enough, just nice to know that it will continue to become better.