Algorithms – no one size fits all

written by Jason Short on Friday, July 11 2008

 
I have been spending quite a bit of time lately thinking about algorithms. Not always specific concrete ones like sorting items in a list. More in the general sense of how to determine the correct process to create a desired outcome.
 
Easy choices are... easy
 
There are some places where decisions are obvious. If you need to keep a list of all the indexes contained in a database, or what categories of data are available the choice is usually pretty easy. The choice is also usually minor in terms of the outcome. The route you took to reach the decision is not always important. As long as you get everything in alphabetical order no one cares how you did it.
 
Subtle choices are much harder
 
Subtle choices become much harder. Take as an example the caching of indexes from a database. How much of your precious ram and heap do you spend on that? That answer is almost always – it depends. What about things like when to store data in a cache rather than fetch it from disk or remote location? Those choices start becoming harder and harder.
 
How much is too much?
 
Now take into account things like the volume of data. Do you plan to store 1 million rows or 10? There are general purpose algorithms that work for both, but then you have optimized for neither case. What if optimizing for one case leads to a 100x improvement in performance for that specific case? Do you implement the optimization all of the time, or conditionally?
 
Caching and storing of data is another difficult question to answer in general purpose terms. What works great for a single user dedicated app will fail horribly under a severe multi user scenario. There are lots of cases where we in essence optimize for worst case scenarios – always assuming disaster is about to happen. What usually ends up happening is you kill performance for most of the general purpose cases just to handle that one really strange scenario.
 
General purpose means optimized for none
 
I have never been a big fan of general purpose systems. I usually build for specific constraints in the system. If I need to store a lookup of several million values I will implement it totally different that a quick look up in a list of less than ten. But where do you draw the line? 
With the database engine we have to handle all sorts of scenarios. Should the engine special purpose for both ends of the performance spectrum? Or do you maintain two code bases for such purposes? I can remember years ago using a data engine that you could load the large or small version of the library. 
 
I think in some ways SQL CE is doing this because you have to tell the engine the max size of the database. The engine can then in some ways cheat to optimize what it expects you to do in that case. Tables that would have been kept in RAM or cached will now be spooled to disk or partitioned.
 
Interesting to think about
 
I don’t have any answers, but I have been spending a lot of time thinking about it lately for VistaDB and other projects I am working on. It is always difficult to know when you have struck the right balance.
 
Optimization for your target means having a valid sample to optimize against.  But in programming, as with life, it is not always so neat or orderly. 
 
 
 

Similar Posts

  1. New Feature Frenzy?
  2. The GC does not solve all memory leaks
  3. VistaDB Customer Survey Results Fall 2007

Comments are closed

Options:

Size

Colors