Strongly typed performance gains

written by Jason Short on Thursday, April 17 2008

I have mentioned to a few people that we are working on optimizations and performance internal to the engine.  I wanted to take a brief blog post to explain an optimization concept that is subtle, but makes a big difference in runtime performance.

Hashtables

Internally quite a few structures are stored as Hashtables for performance of lookup.  This is often not optimal because Hashtables do not understand any native types.  A cast must occur from the key type to the internal storage of the Hashtable.  This seems like a small thing, but it adds up quickly.

Converting a Hashtabel to a strongly typed Dictionary<string,int> can have a massive impact on performance with almost no change in memory footprint.

List Merger

Originally we discovered this tip from a tool that merges down lists of large text files for the Spam Shield.  This tool used a Hashtable and kept lists of domains to be removed or duplicated.  During one particularly long and ugly merge (we were literally waiting 12 hours to get a complete merge) we decided to look at other data structures to store this data.

The original code used a Hashtable MergeThese() - We tried to preallocate the size of the array and it did help performance a bit.  Then we noticed that the majority of the time was actually being spent in the .Contains() method for the Hashtable.

We replaced the Hashtable with a Dictionary<string, int>.  You really don't need the int at all (we leave it at 0), but because the Dictionary is now strongly typed (thanks to generics) your lookups are much faster.

This code in the old system:

if( mergeAddThese.Contains( temp ) == false )

turned into:

if( mergeAddThese.ContainsKey( temp ) == false )

Trivial code change - exact same logic.  The Add was similar.  The performance blew our socks off.

Input data 500MB of domain data that needed to be merged, sorted, reduced, normalized, etc.

Original Merge - Peak memory

9 h 26 m - 287 MB

Dictionary Merge - Peak memory

1 h 57 m - 269 MB

Basically a 300% performance gain by changing the way a key was looked up in a list.  We tried ArrayLists, and other objects, but the templated Dictionary to a string worked the best by far.

Stop Words

Stop Words are the list of words used during Full Text Search indexes for skipping the word.  Every word in your Text or VarChar field has to be checked against this list to ensure we are not indexing words of very little benefit to the FTS lookups.  Words like "a, an, the, there, etc" are all pretty common for skipping.  We have a little over 540 words for the English language (See the help file for the list).

These Stop Words were being stored in a Case Insensitive Hashtable.  When we compare the words, we have to ensure that capitalization does not allow the word to skip.  You might think of performing a .ToLower() on all the strings or some other solution.  When you find yourself thinking of a common problem look in the Framework.  9/10 times Microsoft has your solution.  It is called the StringComparer.CurrentCultureIgnoreCase Comparer for string.  It takes advantage of the current culture and it is very fast (much faster than lowering your strings actually).

I took the original Hashtable based Stop Words class and tested how much time was taken to build an index on a sample database.  The database has around 1000 strings in it of small length.  Then I rebuilt the class to use a strongly typed Dictionary and ran the same tests.

Hashtable implementation

Loading Hashtable 471.6495 ms
Build 1 col Index  19.53 ms

Dictionary implementation

Loading Dictionary 2.9292 ms
Build 1 col Index  4.8825 ms

The load only happens once per application database.  So it is not a big deal.  But it is still nice to shave over 1/2 a second off load time.  The real performance was in the FTS Build of the index.  For this relatively small index is was 400% faster.  I can only imagine the savings as the size of the index goes up.

Optimization

This optimization is in Build 57.  We have several more classes that use similar constructs to this that we plan to recode for performance like this.  I don't expect a miracle on the overall engine performance, but every little bit helps.  And we may find some larger performance bottlenecks while doing this test.

There are also some newer Disk I/O constructs we can use for better performance that I want to try within the DataStorage layer.  I will post my findings as we go through this process.  I feel like now is the time for us to start looking at many of the internal object classes for performance, temp object allocation, casting, and other practices that rob your application of performance.  It should be interesting.


kick it on DotNetKicks.com

Similar Posts

  1. The GC does not solve all memory leaks
  2. SQL Server 2008 (Katmai) Information
  3. New Feature Frenzy?

Comments

  • dpigford on on 4.17.2008 at 10:31 AM

    dpigford avatar

    Jason,

    Thanks for sharing the tip. After reading this post I tried the same approach (switching from HashTable to Dictionary) in one of our applications. You findings seem to be correct; it is a huge runtime gain. Armed with the NUnit tests, VistaDB is in a great position to dial in on some of these specific issues.

    -DP

  • OmariO on on 4.17.2008 at 11:44 AM

    OmariO avatar

    Have you tried HashSet<T>? It stores only keys. msdn2.microsoft.com/.../bb359438.aspx

    Useful links regarding performance: http://blogs.msdn.com/ricom/ http://blogs.msdn.com/vancem/default.aspx blogs.msdn.com/.../default.aspx http://blogs.msdn.com/profiler/ forums.microsoft.com/.../ShowPost.aspx

  • Jason Short on on 4.17.2008 at 12:14 PM

    Jason Short avatar

    Actually I did not try HashSet<> - didn't think about it. I will have to give that a try and see how it performs, thanks!

  • Jason Short on on 4.17.2008 at 1:31 PM

    Jason Short avatar

    It appears that HashSet<> is only supported in 3.5 - didn't know that. We still have to support 2.0 and CF 2.0. Do you know of another alternative that works with 2.0?

  • Jason Short on on 4.17.2008 at 6:11 PM

    Jason Short avatar

    As a follow up on this, I took some of the other methods that were using a hashtable and replaced them with strongly typed objects and they resulted in similar gains.

    internal class FunctionCollection : Dictionary<string, FunctionDescr>

    Now the FunctionCollection is strongly typed and can avoid the previous cast from object to a FunctionDescr class. This resulted in an improved speed for the parser. Every little bit helps, and free performance gains like this are always very welcome.

  • Hans Nieuwenhuis on on 4.18.2008 at 11:34 PM

    Hans Nieuwenhuis avatar

    Very nice to know that VistaDB will become not only better but also faster with every build! Just got build 57, but I'm already looking forward to build 58. :-)

  • logicdev on on 4.20.2008 at 2:10 PM

    logicdev avatar

    Just a crazy thought but would the following (if implemented in VistaDB obviously) improve performance much?

    vcmd.Parameters.Add( new VistaDBParameter<string>("MyCol",VistaDBType.VarChar));

    The question comes down to can Parameters be strongly typed by using generics, and if so would it improve performance.

    Another approach would be where the VistaDBParameterCollection.Add method internally constructs a VistaDBParameter<> object based on the VistaDBType that was passed.

    One result would be a strongly typed VistaDBParameter.Value property. Anyway, just a thought.

  • Jason Short on on 4.20.2008 at 7:32 PM

    Jason Short avatar

    I really want to make the parameters strongly typed. It will require a lot of changes though to the current interface (breaking changes too). It is something we are looking at though due to the dramatic performance difference. Right now all the Parameters are treated as just objects, and then cast.

    Hans, never worry - VistaDB will continue to get better over time. We have a lot of ideas and energy for future performance.

  • Hans Nieuwenhuis on on 4.22.2008 at 1:37 AM

    Hans Nieuwenhuis avatar

    Jason,

    I'm totally not worried...I have a lot of trust in your company / developers!

    For me it's already good enough, just nice to know that it will continue to become better.

Comments are closed

Options:

Size

Colors