Full Text Search

written by Jason Short on Monday, May 14 2007

Full Text Search This week’s build 25 includes for the first time Full Text Search capabilities.  I wanted to explain the how and why behind Full Text Search, and some of the limitations as well.  This initial release should be considered just that, an initial release.  We will be continuing to add much more to this in the coming weeks and months. Understanding the intent and purpose of full text search indexes will improve how you use and build them into your applications. Full text search is a very simple concept, but can have a very large impact on database performance. There are a number of techniques used in Computer Science for building and maintaining indexes on full text search terms. The following is not intended to show how we work, but rather give an example of the concepts. The following is a simple example of how a full text index might be created for the following database rows.

DocumentID Text
1 Pease porridge hot, pease porridge cold,
2 Pease porridge in the pot
3 nine days old
4 some like it hot, some like it cold,
5 some like it in the pot
6 nine days old
A full text search would first split each word and build an index that might look like this:

IndexID Word DocumentIDs
1 cold 1,4
2 days 3,6
3 hot 1,4
4 in 2,5
5 it 4,5
6 like 4,5
7 nine 3,6
8 old 3,6
9 pease 1,2
10 porridge 1,2
11 pot 2,5
12 some 4,5
13 the 2,5
Performing a lookup for CONTAINS(word, 'the AND some') would return documentID 5 since it is the only documentid that contains both words. 'the OR some' would return 2,4,5 since they are contained in all three documentids. As you can see building an index over large quantities of text can generate very large indexes. Compare LIKE to full text search Comparing the lookup paths between a full text search and a LIKE predicate in the above scenario consider the following: CONTAINS(word, 'the AND some') - generates 2 index scans in the FTS index LIKE IN ('the', 'some') - generates 9 row loads, then performs 18 string index operations. Nothing is free So as you can see the CONTAINS operator can be considerably faster than performing a LIKE.  This speed does not come for free.  During the insert / update / delete the textual columns for the FTS Index must be parsed, tokenized and inserted into the index.  For data that changes rapidly this can be a huge problem.  FTS Indexes are generally only used on tables where the change rate is fairly low, but the number of lookups is high. Many Enterprise SQL applications provide a service to maintain the full text indexes.  This is one of the primary reasons FTS is considered an Enterprise Application.  It is expensive to perform the operations in real-time, so most SQL providers simply tag the rows as requiring update and then allow the background service to periodically update the FTS indexes.  SQL Server Express is a notable exception in that Microsoft does not give you any agent to update the index, it appears to be performed through a trigger on the table.  I find this odd, since all other versions of SQL Server include the SQL Agent to update and maintain the indexes.  Of course SQL Express does not ship with FTS services, you must download the advanced services pack.  My testing of SQL Express with FTS and SQL Server Workgroup Edition found my test database FTS indexes were 4x slower in Express. Full text catalogs VistaDB does not use a separate catalog file for the FTS Index.  We keep all the FTS data with the database for easier XCopy deployment.  SQL Server and others keep their FTS indexes in separate directories called catalogs that usually include their word count and index files. I am considering adding an additional component to VistaDB that acts as the Agent from other systems.  This Agent would then have the ability to rebuild or defer FTS updates to a scheduled system.  I need to think about the application of this some more, but I think it may seriously help boost performance of the applications that use FTS heavily. Future Changes In the future we plan to add the ability for you to override how words are stemmed, and tokenized through CLR Procs.  You may have a specific language or content that needs a special parser, this will give you the ability to override our default system and extend it for specific needs. Additional Reading Managing Gigabytes – Witten, Moffat, Bell, Second Edition I highly recommend this book for anyone who is interested in the low level science of parsing, stemming, and building index systems to handle large scale systems.  It is very math and science heavy, with a lot of low level explanations.

Similar Posts

  1. SQL Server 2008 (Katmai) Information
  2. VistaDB Customer Survey Results Fall 2007
  3. devLINK 2007 – post show analysis

Comments are closed

Options:

Size

Colors