Level Extreme platform
Subscription
Corporate profile
Products & Services
Support
Legal
Français
Articles
Search: 

Using Full Text Index
Oscar Zarate, March 1, 2003
Handling large text volumes is always a complication for systems, and mainly for ourselves, developers. When we have the need to handle fields with vast amounts of text our tribulations begin, and this is even worst if the user needs to search for a single word; of course not the first word of a phr...
Handling large text volumes is always a complication for systems, and mainly for ourselves, developers. When we have the need to handle fields with vast amounts of text our tribulations begin, and this is even worst if the user needs to search for a single word; of course not the first word of a phrase, but one that's everywhere within. Here's where triumphantly enter the products that allow for index building over those fields.

Using Full Text Index with MS SQL Server

A clear example where we find this need is a library system, whe we have a table with book data and that table has a field with a short overview (70 words) of the book content, and where we must provide the ability to search for a word within this "short" description.

It is true that there are statements as LIKE and seach wildcards, but, have you sometime tried to perform a search for a word in a 20,000 records table over a 2,100 characters field with LIKE?

Alright, if you want to perform a search like the later and get a reasonable response performance, you would need a specific index, not the classical ones but a special one, which is managed by Full Text Index (FTI) under Microsoft SQL Server (SQL).

I'll tell you that I did a real test, in a table with 2 million records with a 75 characters field. Performing a search with LIKE returning a single record with the corresponding quantity (SELECT COUNT) takes 1’ 22” (a minute and 22 seconds) and the same query trough an FTI index takes 5” the first time and 3” the following. Not bad.

Full Text Search is the free text indexing service available in Windows and can be used to search over HTML pages among other things. This service is integrated into Microsoft SQL Server 2000 (it is installed toghether with the database engine) and with Microsoft SQL Server 7.0 (although you have to ask for it to be installed, as in this case is not a default option). This is the service that solves our problem.

Figure 1: SQL Server Componentes (Microsoft SQL Server Books OnLine)

After installing FTI you'll see an added item called Microsoft Search (in the services combo) for the SQL Service Manager (the icon allocated on the system tray). There you could see that this service runs as a process separated from the SQL, and it can be independently running or stopped. That means that FTI will be STARTed or STOPed no matter if SQL is STARTed. Simmilarly like with SQL, we could set the service to start as the machine boots up.

As I mentioned earlier, the service runs independently, as well as the index management (even as it is very integrated). That means that when you want to use an FTI index you have to enable it from a special option.

Each database table can have a related FTI file, having into this index one or mor indexed fields. This index is known as a CATALOG.

Inlike a regular SQL index, when you finish a catalog creation, this isn't ready to be used, but it needs to be populated. This is a separatedly triggered process that makes the index to be loaded with the corresponing content.

Let's see a simple example: in this case we'll create a catalog within the PUBS database that comes along with SQL, over the TITLES table.

Figure 2: Creating a catalog for a database

From the Enterprise Manager, we right-click over the table and choose Full-Text Index Table and within this option, Define Full-Text Indexing on a Table..., which is the only one enabled. That would open a Wizard to help create our catalog.

The steps are:

  1. Select the Unique index.

  2. Select the fields to index on. In this example, the TITLE field.

  1. Select the physical filename, containing the catalog and pathname. In this example case the location will be the default one and the calatgo name will be TITULOS.

  2. The we are presented with the option to create a Scheduled Task to populate the catalog. In our case we'll skip this to manually populate our catalog.

  3. Finally we confirm the catalog creation.

Very well. After these steps we have an empty catalog created. As the following procedure we will populate the catalog.

From the Enterprise Manager, right-click over the table and select Full-Text Index Table and within this option, Start Full Population. That will display a window stating that the process has being initiated.

Figure 3: Populating a catalog

As I mentioned earlier, this runs in a separated process, so we wouldn't get any warning when it's finished. In this case, as it is a small table (it comes with just 18 records by default) the time it will take is very short, but in bigger tables this time can raise exponentially.

To know about the process status, click over the Full-Text Catalogs folder on the PUBS database. There you'll see that a record over the newly created catalog exists, and if you look at the column corresponding to Status it would be "Idle" when the process is finished, or "Population in Progress" during its running.

Finally, let's make a test in Query Analizer, where we'll write:

select * from titles where contains(*, 'silicon')

we'll see that it returns two records corresponding to two books in whose title the word Silicon appears.

Here ends the introduction to the usage of Full Text Index. We saw an brief explanation over its components and how to create, populate and use text indexes. From here I'll explain how to use all this on real-world systems, FTI index advanced administration, and the pros and cons of this working method.

NOTE: During my long years working with FoxPro I have to solve this topic with DBF files, and I did witha product called phDBase que sadly seems to be out of market. I say sadly because the product wrorks really good with DBF files and its usage is nearly transparent. We should only add this library and lear some simple predicates to index MEMO fields. There are currently other products that fill this gap, but I can't give you any references as I didn't used any of them.

Using Full Text Index in real life

Even when the later example es completely real, this is not a case in which using FTI is unavoidable, as a query using LIKE could have responded quiet good as well (not very efficiency, but it would do it), but when we have tables with many records and several fields, its usage turns essential.

Before delving more in the use of FTI, it would be good to talk about the Microsoft recommendations about the usage of this service.

Microsoft recommnds to keep the catalog physical files in another disc, different from the one that holds the databases. They also recommend that the tables don't surpass a million records. They also make a point for powerful machines with plenty of RAM, multiple processors and a lot of free disc space. Accompliching this recommendations FTI really flies high, although not every installation can have such a good dimensioned equipment. However, I can assure you that with normal machines (the type needed for a server) results are actually excellent.

Going deeper on the use of FTI

The catalog management can be done from the Enterprise Manager (see the previous example) or from the Query Analizer.

As we saw in the example, the first step to generate the catalog was its creation. We did this from the Enterprise Manager, with right-click over the table and selecting Full-Text Index Table and then, Define Full-Text Indexing on a Table... and following the Wizard. To perform the same operation from the Query Analizer, we do -over the database we want to index (for instance, PUBS):

-- Enable database to work with FTI
   sp_fulltext_database N'enable'

-- With this statement we get the database status respect to FTI
   select DATABASEPROPERTY(DB_NAME(), N'IsFullTextEnabled')
-- 0 => disabled and 1=> enabled for FTI.

-- Invoking this Stored Procedure, we create a catalog called TITULOS .
   exec sp_fulltext_catalog N'TITULOS', N'create'

-- Invoking this Stored Procedure, we select the primary key
   sp_fulltext_table N'[dbo].[titles]', N'create', N'TITULOS', N'UPKCL_titleidind'
-- It is essential to have a primary key.

-- Invoking this Stored Procedure, we select the field to index on and the language.
   sp_fulltext_column N'[dbo].[titles]', N'title', N'add', 3082
-- Valor 3082 => Spanish Modern (0x0c0a), You can use others as 0 for Neutral, or 1033 
-- for American English (0x0409) or 2057 for UK English (0x0809), etc.

-- Finally, invoking this Stored Procedure, we activate the catalog.
   sp_fulltext_table N'[dbo].[titles]', N'activate'

The following step is to populate the catalog with:

-- Invoking this Stored Procedure, we populate the catalog.
   sp_fulltext_catalog N'TITULOS', N’start_full’

With the following statement we can know about the catalog status.

   Select FullTextCatalogProperty(N’TITULOS’, N 'PopulateStatus')
-- It returns 0 => if Idle and 1=> if the process is running, among other values.

Populating catalogs

En the previous examples -the one made from the Enterprise Manager as well as the one from the Query Analizer-, we did Full Population and ignored the Incremental Population option. this is because the first time que populate a catalog it is the same to ask for a Full Population than an Incremental Population. The real purpose of the Incremental option is to keep adding records to the table and that those get reflected in the catalog (remember that FTI works separatedly from the database engine and the action of adding records to the table doesn't implies that they are on the FTI catalog).

a catalog's Incremental Population works only with the records added to a table if it has a TimeStamp field. This is used internally for the engine to know what record it has to add. Withouth this field presence the population would be complete even if an incremental would be better.

T-SQL instructions for FTI

The main instructions (they are actually predicates, as we normally use them as part of a SELECT statement) to perform searches over an FTI catalog are:

InstructionUsage
CONTAINS It is used to search in columns with text data having exacts or approximated matches, with exact words or certainly close. It can be used with a word, a phrase, a word or phrase prefix, a word close to another, etc.
FREETEXT It is used to search on columns with text data matching with the meaning and not the literal content. FREETEXT queries have less precision that the ones performed with CONTAINS.
CONTAINSTABLE Returns a table with one or mor rows for those columns that contains character data in which exact or approximated matches have being found for single words or phrases, words close to a given others (within a given distance), etc.
FREETEXTTABLE Returns a table with zero, one or more rows whose columns contain character data whose values matches the meaning, although not literally, of the specified text.

Common word list

In every catalog there are common words that are not desired to be indexed. SQL provides a language file (NOISE.ESN, NOISE.ENG, NOISE.ENU, etc.) with a list of common words, located in the "\MSSQL\FTData\SQLServer\Config" folder. These files can be edited and this would be take in consideration at the moment of catalog population. If we perform a search including just a word included in this file SQL would answer with the message: A clause of the query contained only ignored words.

Usage recommendations

Microsoft recommends not to surpass a million records for each table, but my personal experience is that with two millions it works very good, and even also with 4 millions, although it must be taken in consideration that population times are high.

But, what happens when we have tables with more records than that? In my case, I have to say that I had to build systems handling 20 millions records and I know some cases that handled about 50 million records. The solution was to divide the table horizontally in several tables and performing parallel searches.

This is indeed essential when working with SQL 7.0 as if we ask for the population of a catalog with a higher number of columns, the process would never end. This is solved in the 2000 version, but the population times for a catalog of 2 million records is about 6 hours in a dual-processor machine, 1024 Mb of RAM and SCSI discs, and about 14 hours in a single procesor, 512 Mb of RAM, IDE discs machine. I did test populating a table with 15 million records (over SQL 2000) and it took about 54 hours.

Conclusion

If you sometime need to work indexing huge volumes of text over SQL Server tables, you'll have available a really powerful tool. Full Text Index responds amazingly to complex queries and it is versatile enough working in not so powerful machines. The only caution you have to take in consideration is when you have a terribly big record set. You have to allocate a high initial time and maybe you have to think in break your tables horizontally to get the best results.

Oscar Zarate, National Australia Bank
Oscar Zárate is Bachellor in Business Administration, and Systems Analyst. He works as a developer since 1986, starting with the first versions of FoxBase, through Clipper, until the current versions of Visual FoxPro. He is MCP in Visual Basic .NET. He has also worked, among other tools, with Visual Basic, ASP, XML and SQL Server. He was the Treasurer of a the Microsoft User Group community (www.mug.org.ar). Currently, he is living in Australia after 36 years living in Argentina and now he is working for National Australia Bank after working more than one year for Triveni Infotech in Melbourne. www.oscarzarate.com.ar
More articles from this author
Oscar Zarate, June 1, 2003
A few years ago Microsoft launched a contest to award research projects among universities. On that topic, we interviewed Carlos Alejandro Pérez, Director of the project in the National Technological University (UTN), branch of the province of Chaco (Argentina). His project was selected from a...
Oscar Zarate, March 1, 2007
Since the time we program with object orientation, we can play at being all-powerful, deciding the lifetime of our objects, and limiting "what they can, and what they can't, do". Now, in some cases these objects seem to take on a life of their own, and don't respect our wishes. As the old saying goe...
Oscar Zarate, September 1, 2003
On August 7th and 8th took place at the Village Recoleta Complex this international event whose motto was "The more you know, the farther you reach". There were two days where the top speakers of Argentina gathered to bring out a true mega-event. Even more, the audience could take advantage of...
Oscar Zarate, February 1, 2003
Introduction Visual SourceSafe (VSS) is the version control (VCS) and source-code management system provided by the Microsoft Visual Studio product family. A version control system basically lets you manage the source programs within a development group, maintaining the entire modificati...