Visual FoxPro and Rushmore optimization

Level Extreme platform

Subscription

Corporate profile

Products & Services

Support

Legal

Français

Articles

Search:

Visual FoxPro and Rushmore optimization

Carlos A. Pérez, April 1, 2002

The optimization There had been written so many pages about Rushmore, FoxPro's distinctive characteristic. Many programmers, even without in-depth knowledge, guess that Rushmore is something related to the indexes, something that speed up queries Well then.. what is exactly the Rushmore Optimi...

The optimization

There had been written so many pages about Rushmore, FoxPro's distinctive characteristic. Many programmers, even without in-depth knowledge, guess that Rushmore is something related to the indexes, something that speed up queries

Well then.. what is exactly the Rushmore Optimization? We can state - for now - that is the use of indexes in order to accelerate data retrieving. But stated this way, there are still plenty of questions that remain unaswered.

To understand the underlying mechanism, let's first see some theoretical basement, that surely will be welcome. Perhaps some paragraphs could be a little bit difficult to be understood at a first glance, but the final concepts can be easily learned and remembered. Please don't expect all the points about Rushmore to be considered, due to the fact that Microsoft - as any other software company - does not publish major details about internal architectures.

Storage Hierarchy

Storage hierarchy refers to the layers of storage that exist in a computer in order to store information. In the databases world , the hierarchy is as follow:

Physical RAM memory. Few capacity (less than 2GB), great speed (10ns). Volatile storage. Random access.
Random access magnetic medium. Normally, a hard disk. Intermediate capacity (about 50GB), medium speed (7 ms). Non-volatile storage.
Sequential-access magnetic medium. Great capacity, (hundreds of gigabytes), slow speed (about 1 to several minutes). Non-volatile storage.

Thus, we can affirm that databases normally operate at layers 1 and 2. In layer 1 take place the PROCESS, in layer 2 takes place the STORAGE. We can easily infer that every process need data to be transferred from layer 2 to layer 1, and vice versa. We'll be back on these concepts soon.

Relational model

At the beginning of the seventies, due to the fact that every information system was propiertary, and the data was stored in propiertary formats, many hours of programming used to be consumed in order to alter the system functionality. There was not a clear isolation between the program itself and the data.

They needed the so-called "data independency". A program should be protected against the changes that may occur in the data structure or organization. The program is needed, and aside it, a new concept emerged, separated from the program itself: the database. The database provides physical independency, because in spite of how the data is stored or the page size or the storage device, the program continues to "see" the data the same way. It also provides logical independence, because the fact of adding new columns or stored procedures should not interfere with the actual program being used (of course, this has limitations, for instance, if we drop columns or entire tables, this concept is no longer valid).

Until that time, COBOL processes used to operate against text-like files, some files used special fields delimiters, other used addressing mechanisms in order to indicate the beginning and the end of given fields, etc. Each one of these approachs had their own limitations as well

Well now... what is a "model"? Is a set of concepts used to describe data. And why relational"? Because is based in the set theory, where the term "relation" refers to an association among two or more elements of a set. In other words... a table.

The relational model is based on tables, they are called "relations". This should be kept very clear: tables are relations, they are synonims, and these terms can be used interchangeabily. Each table is made of rows and columns. Repeat yourself: tables are relations.

In english, there is no confusion between "relation" and "relationship". In other languages, specially in Spanish, the term for relation and relationship is, unfortunately, the same (relación).This forced to the spanish community to use other terms in order to avoid the fatal confusion. The selected term was "vinculation" for relationship. This led to several glitches , and a myriad of theoric discussions were raised about semantics, etc. For us, "relation" is the table, and the relationship is the association between tables - the one we could make by using SET RELATION. The relational model takes its name from the term "relation" or "tables".

This model forces each table to have a key. What is a key? Is the minimum number of columns of a given table that identifies uniquely a given row. For instance, the table EMPLOYEES needs a key, and the SSN (social security numbers) could be easily adopted. In ARTICLES, a key could be (group_code,article_code) - this is a compound key - and so on. Please note that the "minimum number of columns" must be respected. In our table EMPLOYEES, the SSN and NAME columns also define uniquely the existance of a given row, but SSN+NAME is not the minimum number, instead, SSN just matchs the requeriment. SSN+NAME is called a "superkey" (this is, with not the minimum number). In relational model, that kind of superkeys should be avoided.

Primary keys, candidate keys, all of them are keys. In relational model, a non-unique key is not permitted, due to its definition. In Visual FoxPro, the primary key is our "principal key", the candidate keys are also called candidates in VFP.

This leads to an unsuspected side effect: as we just stated, in relational model there is no chance to the fields conforming a key to be repeated, in other words: no table can have two identical rows.

But we, old 'n' good VFP programmers, we know that is not exactly true. We develop using duplicated rows, and many tables in our systems have duplicated keys. What is going on? Let's be patience, we will see it later.

Relational Algebra

Why too many hip-and-hoopla about the relational model? Because it was the first time that a data structure was described using precise mathematical models. Because tables are "relations", operations between tables conformed a "relational algebra".

Let we see: in normal algebra, there is an operation to SUM two numbers. Also, in the relational algebra, there is a UNION operation to "sum" two tables. Normal algebra operates with numbers, relational algebra operates with tables. As the normal algebra returns results which are numbers, relational algebra returns results which are tables.

Then, at 1973 the relational model was completed. Five basic operations were defined in order to operate with tables. This five operations were: selection, projection, renaming, join and union. Each operation acted against tables, and returned necessarily other tables.

In order to the relational algebra have all the operations, the DIVISION operation were derived from the basic operators. Division is a little bit exotic because is the sequence of many operators, and it was included to "close" the algebra (few real-world applications qualifies for the division operator). The algebra was closed and completed, you can go back and forth using relational operators, as in normal algebra. For instance, in normal algebra multiplication there is a counterpart in division, in relational algebra for the cartesian product there was also a division, etc.

If we pay a little attention, there is no DELETE operator. In fact, deletion is not allowed in relational algebra. Neither is the SORT operator. What's happening?

Database Languages and practice vs. theory.

Relational algebra allowed D.D. Chamberlin (a researcher at IBM's San José Labs ca. 1973) and others to start the development of the primitive SQL, structured query language. For instance, the SELECTION operator allowed the WHERE clause, the PROJECTION operator led to the SELECT-SQL column selection list, etc. JOIN operator led to the JOIN clause in all its flavours, UNION had the UNION clause between SELECTs, etc.

For each relational operator, IBM researchers created specific commands to implement them. There were a 1 to 1 equivalency between table algebra and SQL. everything. From the hierarchy and ascendency point of view, IBM is the mother of all databases. For instance, it was the first to resolve recursive queries in real-world implementations (the "WITH RECURSIVE" clause can be found in some DB/2 platforms), at ages when SQL 3 (1999) was just a dream. Oracle will become a protagonist only several years later.

After few years, the researchers left IBM and founded their own enterprise: The Relational Institute, which pervived until nowadays, still researching the mathematicals roots of the relational model in a very high level. The discussion seems to be endless, with certain degree of "fanatism" the model is "attacked" and "defended" by universitary communities, independents researchers, and private firms. But, paraphrasing Mr. Date: "The relational model has 30 years of uninterrupted success. This is the proof of its consistency and robustness". Actually, there is nothing that could let us suspect about the truth of his words.

There are only two database languages in "pure" form: SQL and QBE (query by examples), created by M. Zloof in 1975. Other languages fail to implement totally relational operators, believe it or not. QBE is a language that used a grid to make up the queries, and its not declarative, instead, the user fills the grid to make "criteria" of selections, filtering etc. The system then executes the query itself. Several years ago, it had some followers, but now SQL has totally outpaced QBE by a wide margin.

SQL became the foundation basement for several dialects. There exist three ANSI standars that define the language scope. Each standard is referred by the year of its promulgation: level 1 for the year 1989 (SQL 89), level 2 for the year 1992 (SQL 92) and level 3 for the year 1999 (SQL 99).

Each software manufacturer releases its SQL version with a given degree of compliant to the standard. Thus, we have that the vendor XXX offers 80% of SQL 2 compliant, and vendor YYY offers 90%. This does not mean that YYY's product is best than XXX's. It simply means that they're different. Ideally, it could be okay to get - at least from some vendors - a 100% of adhesion to the standard, but this has some unavoidable drawbacks: the database manager would be enormous, and extremely expensive to built. Nowadays, none of the commercially available products offers such degree of ANSI compliantness. For instance, no one supports the DIVIDE operator. The INTERSECT operator is not supported even by Oracle. There are, instead, shortcuts to get the job done, consisting of combinations made of basic commands. In fact, there is always a trade-off.

About 80% of the database companies are giving a reasonable SQL 2 compliant level. Only a very few adheres to level 3. An exception is IBM's DB/2, which has a high degree of adhesion to such standards, nearly 100% as we could expect due to its story in DB arena. It is also remarkable the role played by PostGRESQL, an academic-born database manager which has a very interesting Linux open source version under GNU license, which is highly SQL 2 compliant, and including some benefits from SQL 3. Unfortunately, IBM's DB/2 is a RDB manager with a deep technical complexity, its learning curve is frankly steepy if we attempt to get it working at full power. In our tests, instead, SQL Server has resulted to be a light manager, fast, efficient and at the same time relatively easy to master. Some folks prefers the technical depth of DB/2, and some prefers the more friendly and intuitive SQL Server interfaces and consoles. We are not issuing here a comparison of speed or other factors, we simply denote the different idiosincracies from two major contendors in DB arena such MS and IBM.

Let's see an example of difference between theory and practice: the relational model states that a given table cannot have two or more exactly-equal rows, because this will prohibit the creation of the key. And we remember, a relation cannot exists without a key. But we know, from our real-world, daily-experience, that such constraint is no always appliable. In fact, SQL Server allows us to load duplicate data into a table (provided no primary or unique keys were defined). Same does VFP. What happens? Seems there is an non-complete degree with theory premises here too. In databases, practice is always a healthly distance from theory, this is from the very first times of ancient database managers, and it is because of very attainable reasons, as soon we can see.

Another example: the result of an relational algebra operation must be another relation (which in turn must be full model-compliant). Thus, the result of a selection must be a table without repeated rows. In the real world, this does not happen. The result of a SELECT-SQL statement can give us another cursor with repeated values on its key. To avoid duplicated values, we must issue the DISTINCT clause as a directive to explicitly instruct the DB manager to force the uniqueness constraint and match the model.

Based on these simple examples, we can infer that commercial database managers, in their default behaviour, fails to fullfill the relational model and algebra. They implement instead some mechanisms using additional keywords to force the constraints - if the programmers desires so. This is also a fundamental concept.

The optimizer is the layer of the database manager which reduces to a minimum the data transfer between a storage hierarchy layer to another one. Actually, it minimizes the access to disk subsystems. To achieve this, several mechanisms exist. They contribute to get the minimal possible transfer time between hierarchy layer 2 (disk) to hierarchy layer 1 (RAM memory).

Why we do not optimize RAM access? Because the access time to disk subsystem is several magnitudes higher than RAM access. Supposing average times, for a typical SDRAM the access time is about 10 ns (10^-9 sec.), and for a disk drive is about 5 ms (5x10^-3 sec.). Let us note the difference, about six magnitudes (10⁶) between them. Due to this fact, all programming resources are focused on disk subsystem, because the most of the wasted time lies there.

When a database manager access to disk, it performs a set of serialized operations, which conforms an "access method". The optimizer simply modifies this access method with only one goal in mind: minimize the time used in its operations.

Indexes: what they are and what they are not

We arrive to the point of these questions: what are indexes? are they neccessary?

The answer is surprisingly unexpected: NO. Every database system must work without indexes.

Then the natural question is: what are the indexes intended for? are they intended to sort data?

The answer is more unexpected: Indexes main goal is not to sort data, but another important one.

Let we see, then, what is exactly an index, and what is intended for:

an index is a physical external structure, it is not part of a table.

it is not part of the relational model, nor the relational algebra (because of this, any database manager should be full functional even without index availability).

Its main goal is to speed up data retrieving. RDBMS use them in order to alter its access method.

The fact that indexes sort data should merely be seen as a side effect.

Indexes are tables, because of this, indexes are relations too.

Indexes have two columns: the first one stores the values of the key, which in this case it's called "search key". Search keys generally matches the table key. For instance, if we have an index expression like art_code+store_location, in the first column we record the value of that expression. In the second column we record the so-called RID (record identifier). The RID is a pair of integer numbers, the first one points to the page inside the data file, the second one points to a specific record inside that page. There exist more complicated file organizations, but all of them are derived from this concept of <page,RID>.

Index types in RDBMS

The following concepts are appliable to relational database managers in general. But in particular, they are mainly used in client-server databases. In desktop RDBMS only an approach to these structures is used.

All indexes serve to one goal: speed up data transfer. How can they achieve this? The same way as a book index do. If we want to read a segment of given chapter of the book, we first take a look to the index or table of contents. If the book lacks such facility, our only remedy is to examine the book page by page, that is, scan it, until we found the paragraph we are looking for. With tables a similar approach is taken. If the table lacks an index, the RDBMS must scan the entire table, traversing it from the beginning to the end, page by page.

clustered indexes.Clustered indexes have a distinctive characteristic: they force a physical sorting on the underlying table. They match the physical position of the rows with the index internal ordering. In other words, if we have a clustered index on the NAME of a given table, the table itself will become ordered by that column.
Regular indexes: (always referred to client-server databases)). They are non-clustered, normal, indexes. This could seem a very basic classification, but it satisfies our purposes here.

We stated that the ordering - sorting - is only a side effect of the index existance. The real reason is as follows: to locate a given value inside an index file, we certainly need an algorithm. The most used in is the "binary search" and works as follows:

Go to the middle of the file, and evaluate if the search value (the value that is being looked for) is less than the search key stored in the row just on the middle of the file.
If the search value is less than the search key in the file, then we discard the lower half of the file, because we know the desired data will not be located there, but in the upper half.
If the search value is greater than the search key, then we discard the upper half of the file, because we know the desired data will not be located there, but in the lower half.
Then go to the middle of the remaining part of the file, and we repeat the same question and process until we have a match.

Figure 1: Single binary search

table is sorted. If the table is not sorted, binary search is not appliable, doesn't work. Due to this constraint, indexes file are always sorted. This is a requeriment of the binary search algorithm, not the index mail goal.

Once located the RID at the index table, the respective data on the table itself can be easily fetched.

The natural question would be: why we do not perform the binary search on the data table instead? If we do so, the index file is not longer needed. Unfortunately, we used to need more than one index for a given table, so this approach is not valid. Thus, we need external files, because a given table cannot be ordered physically for more than one criteria. So the indexes are mainly external and physical structures.

Another fundamental concept: Due to the smaller size of the index file, compared with the corresponding table file size, it is always advisable to search first in the smaller structure, that is , the index. The amount of bytes being transferred from disk to RAM is smaller, and the process on the index is quicker than the equivalent one performed against the table.

Because the clustered index sorts physically the underlying table, only one clustered index can exist per table. Obviously, a table cannot be physically ordered by more than one sorting criteria. So, the clustered index will be the fastest, due to the inherent physical sorting. Other indexes will have less performance. Due to this issue, if we define a primary key in a client-server database, like SQL Server, the database manager automatically implements a clustered index because of its better performance.

The optimizer and the access plan

We stated that the optimizer minimizes time. In the database nrorec="slang" we could say that minimizes the query "cost" in order to keep the time spent in the query at the smallest available value. The optimizer uses a given method to access the disk subsystem. This method is called "access plan". A given query can have a myriad of possible access plans, the trick here is to find the best of them all, that is to say, the one with less time involved.

What is the query cost measured with? Initially, it was measured in processor ticks. The more ticks a query need to comply the task, the more expensive the query is. Nowadays, other measurement units are defined. The most modern of them involve parameters like disk average seek time, frontside bus speed, processor speed, etc., in order to define an abstract measurement unit. For instance, IBM's DB/2 define its own cost with the "timeron" unit.

All this said, we must remember that more than 90% of the wasted time on a query is consumed at the disk subsystem. Thus, if we could only be able to optimize the disk global access time, we would really be optimizing the entire query process as well. This is the approach that several software manufacturers adopted, this is the main job of the optimizer.

We can classify several kinds of optimizers. The principal ones are:

automatic or heuristics: constantly they try to optimize the queries. They have access to an vendor-made, internal knowledge base, jealously kept in secret and programmed in the database engine, and to an artificial intelligence engine which is basically - believe it or not - a neuronal network. Let's suppose that a query have a hundred ways to be executed. Should the optimizer find the best one? If it does so, surely it will spent more time finding out which is the most perfect access plan than the time to be spent in its execution. The cost of finding the perfect solution is excessive. Thus, the neuronal network finds an good enough solution, an heuristic one, in a lapse that not compromise the final time consumption. For instance, from 100 possibilities, it will examine only 5 or 6. Perhaps the adopted solution would be not perfect, but because the spent time in making that decision is clearly under certain limits, the total spent time is effectively reduced. The heuristic engines work on run time. The most advanced ones perform a total rewriting of the original query, so in those cases we are never sure if the executed query is the same than the query we wrote.

By syntax or code: depends upon the programmer. The queries must be properly written in order to let the database manager use the optimization. Obvioulsy, it must be put on code, so it's a design time work. The proper syntax of the basic optimizable expressions that Rushmore uses is a clear example of this kind of approach.

By wizards, or assistants: they really optimized written code thru a wizard or assistant, in design time. It's a variation of the second kind. Generally are graphical interfaces that decomposes the original query in its basics blocks or processes, and show them in a sort of chart or flux-diagram on screen, with basic geometrical shapes to indicate the execution of such basic processes. Each symbol carries a given cost, and a percentaje of the total cost. The developer can then easily indentify the query bottleneck, and add indexes or alter syntax in order to reduce the total cost. Some wizards can give us hints or recommended actions to be followed in order to enhance the response time.

Automatic optimizers exist in every client-server databases, and the software manufacturer never reveals the deep architecture of such pieces, it is a top secret. SQL Server optimizer is certainly refined and powerful, and in fact many activities are run on background and transparently to the users. It also depends on statistics collected along the time line, that helps the heuristic engine to quickly find the less-than-perfect option.

There exist other kind of response time minimizers, but they do not act on the cost of the query itself, but reassigning system resources such as allocated memory amount and processor timeslicing, through mechanisms of "parallelization". In spite of the fact that the computer has only one processor, some database engines can execute in parallel some pieces of the query, this technique is known as "intra-query parallelism". One of the most remarkable examples is the engine found in Oracle databases, which accepts certain tags or hints interspersed in the query sentence. This tags acts like preprocessor directives, and the compiler then decide which segments can be properly executed in parallel. In our tests, against a PERSONAL table with 1,980,000 records, without indexes, without parallelism (normal settings), Oracle 8.x elapsed, say, 1 cost unit. With parallelism allowed, limited to 2 concurrent processes, the elapsed time was, say, 0,4 cost unit, an effective time saving of about 60%. Microsoft will promptly reach this kind of sophistication, new versions of SQL Server promise to be the best database ever. And remember that with VFP, MS SQL Server has certainly an outstanding sinergy and by far the best footprint - we tested it.

Optimizations in VFP.

In Visual FoxPro - as in all desktop databases - there is not a concept of clustered index. Generally, data files are sets of unsorted records (also known as heap files). We cannot assume that the tables in VFP are ordered, in fact, DBF files are in general kept unsorted. In database servers such as MS SQL Server, IBM DB/2 or Oracle, if a table has a primary key, we should assume that there is a clustered index, and then the underlying table must be then physical ordered. This give us the maximum speed on queries performed on constraints involving primary key columns.

However, we can emulate the behaviour of clustered indexes in VFP. We must run the SORT command periodically, with our sorting criteria matching the principal index expression. Once sorted, we must delete the old table, rename the sorted one and index it accordingly.

In VFP we can find three kinds of optimizers: (1) automatic, (2) explicit (programmer's code) and (3) implicit, involving deletion marks. The utilization of regular indexes is performed automatically, provided the corresponding clause is properly written.

Automatic optimization

VFP checks the existance of any index with its expression equal to the left or right side expression of the WHERE clause. If so, it will use that index.

If the index is not available, the VFP will evaluate the time of index building on-the-fly. If this estimated time is less than a certain threshold referred to the plain query execution, it will build a temporal index, and then execute the query using this index. For take a decision, VFP must first measure the access time to disk, available memory amount, number of rows to be indexed, complexity of the clause, etc. The temporal indexes are stored in the folder defined in CONFIG.FP/CONFIG.FPW (FoxPro DOS and Windows) or in the folder returned by the SYS(2023) system function in VFP.

There is no chance to VFP to execute an heuristic optimization like the performed ones in server databases. Due the the compact size of VFP executable and auxiliary files, we hardly expect such kind of approach. We certainly believe that all efforts on this issue are put on the indexing technique, retrieval and evaluation, with a lighting-fast algorithm - the best of the industry. Of course, these are only guesses based on the few evidence that we can infer from outside .

VFP does not feature a statistical mechanism, nor keeps a statistical log of the executed queries. Thus, we can ensure that is not an heuristic engine based on such knowledge base.

Of course, as we can easily understand, neither Fox Software nor Microsoft have released in-depth information on the underlying architecture regarding FoxPro/VFP engine decision algorithm.

Explicit optimization

Explicit optimization consist in write a FOR or WHERE clause with its lefthand side (or righthand side) equal to the expression used to built the index. This is a developer's task. The programmer explicitly optimizes the query sentence or the optimizable xBASE commands. If certain conditions are respected, the optimizer automatically uses the correspondent index.

For instance, if we have created an index with:

INDEX ON codart+codsuc TAG i1

the correct query syntax will be:

SELECT * FROM artic WHERE codart+codsuc=lcArtic

in order to optimize it correctly. The wrong syntax will be, for instance:

SELECT * FROM artic WHERE codart=lcArtic

in such a case, VFP will not found an explicilty optimized expression, and then it will try to impose an automatic optimization. Please note that in order to the following code be successful..

SELECT * FROM artic WHERE codart+codsuc=lcArtic

.. the ANSI comparison setting must be kept OFF. Otherwise, if this setting is ON, both sides of the equal sign must have the same length (assuming string comparisons), and in general they will return always a FALSE because lcArtic is a memory variable holding only a part of the search key, which in turn is compounded of two fields.

Implicit optimization.

Implicit optimization acts on the deletion mark. Every table which participates in a query -or in a xBASE command that accept a FOR clause- must be indexed on the deletion mark. For example, an index such as

INDEX ON DELETED() TAG deltd

must effectively exist on each table of our application. If so, VFP engine uses it to avoid evaluate records marked to deletion, provided SET DELETED in set to ON.

The reason is that VFP Rushmore engine automatically adds an invisible clause if the SET DELETE setting is ON:

..AND NOT DELETED()

Thus, if this index in available, VFP will use it and the query will be totally optimized. Implicit optimization is always attempted. The query will be not 100% optimizable unless an index on DELETED() is available. Please note that in database servers, the concept of deletion mark is not present, because in relational algebra there is not such possibility: a set has an element, or it has not.

When the index optimization cannot be used?

If VFP determines that optimization cannot be used, the only remedy is to scan all the file pages. Unfortunately, this is a very slow process, because it must load all the pages to RAM, and this is a very time-consuming operation.

There are mechanisms to optimize page access, in spite of the availability of an index. The most preferred one is the hashed file structure. The database engine features a hash function (a "dispersion" function in some literature), represented to our purposes as H(). The argument of H() is the data to be located, and the function returns an integer pointing to the possible location of the desired data.

For instance, H("charles") can return 1632. This number represents a set of pages called a "bucket". If pages 200 to 203 have the bucket number 1632, the database engine does not search the entire file but only those pages instead. When the RDBMS creates the file, it associate them with initial bucket numbers.

Figure 2: Hashed File

Unfortunately, I don't believe that VFP utilizes hash functions to access DBF tables - MS never stated something regarding this issue- because the hash method requires additional storage space, additional space that DBF files does not feature. For instance, if a normal table is 100MB sized, the same data into a hashed file will take 125 MB, because the 25% is always mantained as "reserve buckets". Database manager reserves this space in the same moment of the table creation. Because VFP uses regular DBF files and does not utilize a centralized container - it does use separated files that are logically grouped when a DBC is present - the hashed file is not appliable. VFP database is a logical container, not a physical one (this means that VFP databases only exist as a unit inside VFP itself). On the other hand, a database server normally has a "table space" which is a file, folder or disk partition. Inside the tablespace the data is arranged, in "all the tables together" fashion. The tablespace can span several files, directories or disks. With this approach, it is relatively easy to implement hashed files, because extra space is stored inside the tablespace too and is totally under control of the RDBMS.

Some database server engines applies the same hashed file concept to the indexes themselves (all in all, files are also tables), given even more performance to the normal indexes, and they use a search algorithm only appliable to a subset of pages.

VFP, instead, uses another approach. Because of compatibility, data storage must be materialized using DBF files, which by definition are single, separated files. This eliminates the chance to use a hashed file structure, and leave us with the option to use indexes in the the most efficient way when a query optimization is needed to be performed.

REMARK: Don't be confused with hashed joins, a mechanism that allows database engines to perform very efficiently the JOIN operations. This kind of JOINs uses a "hashed" table into memory. In the above paragraph, we were referring to the hashed file structures stored in disk, not the hashed JOIN algorithm. The explanation of the hashed join goes beyond the scope of this document, and perhaps we can explain it later on upcoming notes. In fact, we think that a hashed JOIN might be performed by VFP under certain conditions -this is, again ,a best guess.

Thus, if no indexes are available, VFP cannot execute the best optimization it uses to. At a first glance, this may seem discouraging for us, developers. But let's see things this way: VFP is a collection of C++ data access functions (possibly with some segments in assembler). No one - or probably, a great percentaje of us - could write such library with better performance. We can assume then, without any doubt, that VFP use of indexes is the best possible. If we keep this in mind, the role left to the programmer is to carefully design the index set to be used, based on the expected queries to be run by the application.

SYS(3054) and the access plan

The access plan in FoxPro, which in our case is synonym of how optimization is done, can be seen by using the SYS(3054) system function. Let's see a simple query example:

SELECT * FROM solic WHERE nrorec="00001452"

where a query is executed against SOLIC.DBF table with no index available. We will see distinct options:

Let's first activate the access plan showing. SYS(3054,x) has a secondary argument x, which can take this values:

0 = optimization level is not shown, default setting

1 = only optimization corresponding to FOR and WHERE clauses that not include table joins are shown, that is, filtering operations

2 = filtering and join optimization are shown regardless of their kind.

In relational algebra exist the selection operator, which indicates the operation of row or record selection (sigma greek letter is our "s", and it applies to table filtering -selection- processes). For instance, to indicate the selection of a single row from SOLIC.DBF table with receipt number equals to "00001452", in relational algebra we would indicate the operation as follows:

The equivalent SQL sentence will be:

SELECT * FROM solic WHERE nrorec="00001452"

where the selection can be seen here as a filtering imposed to the SOLIC table, yielding another table with filtered data. To find out the index utilization or bypassing, we must use SYS(3054) function. Let's issue in the command window

=SYS(3054,2)

and watch what's happening when executing the query

SELECT * FROM solic WHERE nrorec="00001452".

Because indexes are not available, VFP cannot impose any useful optimization. In VFP desktop we can read:

Rushmore optimization level for table Solic: nothing.

Now let's create an index on column liquid by issuing

INDEX ON nrorec TAG nrorec

The index can be of any kind, not necessarily compact structural CDX. Once indexed, let's reissue the query:

SELECT * FROM solic WHERE nrorec="00001452"

now we can check the following message on the desktop:

Using index with tag Nrorec to optimize with Rushmore table solic
Rushmore optimization level for table solic: partial

As we can expect, the index is used to optimize the query. The optimization is not complete because the filter is imposed using the index and the deletion mark, which is evaluated whenever SET DELETE is ON. VFP internally adds a clause AND NOT DELETED() to the query. Because no index on deletion mark is available, the optimization is expressed as "partial".

Let's verify this. Issue SET DELETE OFF and re-issue the same query. Now the echoed message reads:

Using tag Nrorec to optimize with Rushmore table Solic

Rushmore optimization level for table solic: complete

Now let's SET DELETE ON again - as used to be in our systems, let's think about the lack of usefulness if the program accepts valid data and deleted data as well. With this setting ON, let's index now on deletion mark with this sentence:

INDEX ON DELETED() TAG Deltd

Even in the case the tag name could be "deleted", as in this example:
INDEX ON DELETED() TAG deleted
some FoxPro professionals, such as Pat Adams - one of FoxPro 2.x gurus - strongly discouraged the readers to use this approach. She advised to use a similar tag like "deltd", etc. In those days, dBASE IV was not as indulgent as FoxPro - regarding the use of reserved words.

Once the index on deletion mark was imposed, let's reissue the query, and the echoed message is - with SET DELETE ON - the same that we had with SET DELETED OFF and no index on deletion mark:

Using tag Nrorec to optimize with Rushmore table Solic
Rushmore optimization level for table solic: complete

We have a complete optimization, because both indexes are used to perform the enhancement. In this case, the WHERE clause is internally considered as:

SELECT * FROM solic WHERE nrorec="00001452"  AND NOT DELETED()

where

nrorec="00001452"

is the programmer's condition, and

AND NOT DELETED()

is VFP's implicitly added condition, in an automatic fashion, because SET DELETE was ON.

When there exist an AND operator, optimizer interrupts the evaluation when the first operand returns FALSE, that is, when nrorec is not equal to "00001452". The row is discarded before the deletion mark was evaluated.

We can think then about the convenience of locating in first place the NOT DELETED() clause and nrorec="00001452" in second place, in order to evaluate firstly the deletion mark. Supposing that VFP effectively respects the operators' order, this approach should be discouraged, because there are less records that match the condition nbrorec="00001452" than records that match the condition DELETED()=TRUE. Because of this, we can infer that is always advisable to put the most restrictive condition at the beginning of compound AND operators, in order to minimize the time spent to evaluate the logic condition. Supposing that VFP respects the written order, the responsability to optimize by using this technique lies on programmer's shoulders. However, in server databases, this rearrangement is the role of the automatic optimizer. By using statistics and examining the involved tables characteristics, they know which condition will result the most restrictive one. For instance, an advanced database manager will put in first place a condition involving a primary key - there will be, as maximum, only one coincidence.

When a SELECT-SQL query is performed with a selection -filtering- operation, VFP uses to perform this sequence:

a. The original table is opened again, in other area than the currently selected. It is not possible to accurately predict in which area the same table will be open, that depends on several factors, and we must assume a random behaviour in this issue. We can think as if VFP executes an equivalent of USE..AGAIN NOUPDATE in a randomly selected work area.

b. A filter is imposed on the recently opened table by using the index. If the index is not available, VFP decides to create or not a temporary index, based on quick measures of some parameters like available memory, processor speed, etc.

c. As a result, we obtain a read only cursor. This cursor is associated to a underlying table. In order to avoid changes to data residing on the underlying table, the cursor is marked as read-only (see NOUPDATE clause).

This operation is very fast, because there is no considerable information transference between the original source table and the resulting cursor - both points to the same physical table. VFP simply opens the table in another work area, filters it, and marks it as read only. The filtering operation against an indexed table is very quick. If you do not want to use this filtering mechanism, you must add the NOFILTER clause at the SELECT-SQL sentence. This effectively avoids the filter, and creates a new cursor, separed from the original data, residing in the folder indicated by the system function SYS(2023). This cursor, in order to maintain homogeinity, is still read-only. However, it is very probable that VFP never writes this cursor to disk, but in the disk cache being administered by VFP. In our tests, with cursor sizes of about one quarter of the physical installed RAM, the cursor was maintained in memory, never in disk. The DBF() built-in function, when executed on the cursor's actual work area, returns a full qualified file name that actually does not exists on disk. If you execute FILES(DBF()) it simply returns .F. Due to this reason, when we close the cursor, there are no traces left about it on folder SYS(2023): simply put, the file was never written physically on disk.

In FoxPro 2.x versions where NOFILTER clause is not available, we must add a calculated field or virtual column to the query at the selection list. This avoids the use of the filtering mechanism too. Let's suppose that we want to execute this query:

SELECT * FROM artic WHERE codrub+codart="01"

as we seen, this will be a filtered query with USE..AGAIN NOUPDATE.

If we want to avoid this approach, we can modify the query as follows:

SELECT *,.F. AS otherfield FROM artic WHERE codrub+codart="01"

with this, FoxPro will create another cursor in the SORTWORK directory, or in the TMPFILES one. If those environment variables does not exist, FoxPro will use the same directory where the app file is located, and this is not very advisable because a periodic manual deletion of temporary stuff is mandatory in order to avoid directory overpopulation.

Table JOIN (combination) and its optimization

Relational algebra has another basic operation: table JOIN. The JOIN operation needs a JOIN condition. For instance, let's have an EMPLOYEE and DEPARTMENT tables (the good 'n' old example). In order to perform the JOIN operation, we need at least one common field on both tables. Let's suppose that this column is called empID (employee identificator), and it exists on both tables.

The JOIN operation between EMPLOYEE (E) and DEPARTMENT table on the common field empID is denoted -in relational algebra- as follows:

SELECT E.*,D.* FROM empleados E INNER JOIN departamentos D ON E.empID=D.empID

that is the correct syntax for table join. We can also use the implicit join, which occurs whenever the JOIN condition is specified into the WHERE clause:

SELECT E.*,D.* FROM empleados E,departamentos D WHERE E.empID=D.empID

Relational algebra does not impose which comparison expression must be used in the JOIN condition. We can even use an operator other than equal signs, that is, non equi-join operations are permitted. For instance:
SELECT E.*,D* FROM empleados E INNER JOIN departamentos D ON E.empID<D.empID
with questionnable usefulness, but it's permitted, indeed.

If E and D tables have only one common field such as empID, we are able to write the operation without specifying the JOIN condition, it is assumed implicitly. In such a case, the corresponding algebra notation will be:

If there exist more than one common field, and the JOIN operation includes them all, then we have the natural JOIN, where all the common fields must be equal in order to match the corresponding rows from both sides. In our example, only empID is the only common field. All the natural JOINs are based on equalities, no operator other than equal sign is permitted here.

How a JOIN operation is executed?

Let's suppose we have an ARTICLE (A) table, with 1,000 records; and a table TYPE (R), with 20 records. The WHERE clause would be:

SELECT * FROM article A, type R WHERE a.codrub=r.codrub

where codrub is the type code. The set of operations performed in order to materialize the relational operators are called "relational operator implementation", and possible methods are:

1. Record-oriented table JOIN (no indexes)(very inefficient)

For each table of a give row, all the records of the other table are scanned. On each step, the JOIN condition is evaluated. We describe it as follows:

For each record on table A, do:
- For each record on table R, do:
  - if A.codrub==B.codrub then add the set (A,R) to the result set.

In other words: for each row on external table A, we scan entirely the internal table R. The query cost will be

Cost=1000+1000*20= 21,000 I/O operations

2. Page-oriented table JOIN (no indexes)

The file paging is an effective way to reduce the I/O operations quantity. We spent 1 I/O operation when retrieving one page. In the first approach, we waste 1 I/O operation for each record being retrieved. Let's suppose that each page has 20 records inside it. Then the ARTICLE table will have 500 pages, and TYPE table will have only 1 page.

For each page on A table, do:
- For each page on table R, do:
  - For each record on page A, retrieve all the matching records of page R based on the given JOIN condition A.codrub==B.codrub, and pass this records to the result set.

The query cost will be:

Cost=500+500*1 = 1,000 I/O operations

Let's compare this result with the first 21,000 I/O ops, we clearly achieved a substantial enhancement. The position in the WHERE clause is, at a first glance, not conmutative. This means WHERE a.codrub=r.codrub is not equal, in terms of performance, than WHERE r.codrub=a.codrub. Let's see why, providing the "external" table is on the lefthand side:

For ARTICLE as the external table and TYPE as the internal one, the cost will be about 500+500*1=1,000 I/O operations
For TYPE as the external table, and ARTICLE as the internal one, the cost will be about 1+500*1 = 501 I/O operations

We can easily deduct that we should place as external table the one with the less quantity of records, or pages, because all the pages has the same size for a given table kind.

We can think that we must pay attention to this issue with VFP, but don't worry: Rushmore will place as external table the one with less pages, and it will do it for free, for us. The optimizer can found it by using the RECCOUNT() built-in function because it returns a measure that is directly proportional to the number of pages. It has to find out which table has the minor RECCOUNT() to place it as the "external" one.

In client-server databases the optimizer also finds out which table has the smallest page count, independently of how the query was written. However, the information evaluated by this type of RDBMS is certainly more than the evaluated by a desktop database such as VFP. Client-server databases keeps statistical logs, store access plan patterns in the database, and so on.

We can ask ourselves where is our saving by reading pages instead of records. Because the number of transferred bytes is practically the same in both cases, at a first sight we can hardly perceive the difference. The answer is that the cost of reading a page is equivalent to the cost of reading a single record. Once the disk channel is open, it is mandatory to keep it open in order to spent time only one time. In other words, it is preferrable to open the disk subsystem 1,000 times to read 1,000 pages, than to open it 21,000 times to read 21,000 records - even in the fact that a disk cache would be available, because the existance of this facility will benefit both options as well.

We must distinguish between database table pages - managed by the RDBMS - and filesystem pages - managed by the operating system. There is no reason to force them to match in size. In these lines, we are referring to the former kind of pages - table ones.

Remarks on prefetching

In addition, VFP, as well as other database manager, uses to read more pages than those are effectively needed. This is because if VFP needs additional data, is very probable that these data could be located in subsequent pages - in the neighborhood. This mechanism is called "read-ahead" or "prefetching". Some advanced RDBMS allows the user to change the parameters of this mechanism, and configure the optimal size of prefetching using wizards.

VFP does not allow the user to change this behaviour, nor SQL Server does. Microsoft phylosophy - explicitly declared by the company - is to let the RDBMS to set this parameters internally without human intervention. This parameters can be "factory prefixed" or calculated on run time depending on the machine that the software is being run, as SQL Server does (it's a "nobody knows our software better than us" approach).

SQL Server will handle automatically the prefetching mechanism, due to the excellent synergy existing with the operating system, the decisions that SQL Server will perform are ensured to be optimal. In such a sense, Redmond engineers have made an excellent work, because in our tests SQL Server 2000 performance was clearly a world-class one.

As we can infer, prefetching also reduces substantially the time consumption if the table is sorted - or at least the most of it. Thus, it is advisable to keep the table physically sorted using the SORT command, in a periodical basis. SORT operation can be done even with the original table in use by another, only when we rename the sorted file the original one must be closed in order to be removed and replaced by the sorted one. We must then rebuilt the corresponding indexes. Due to this, a daily basis is not recommended, and a weekly or even monthly one is more advisable - of course, if the transaction load is high, you should shorten the lapse between physical sorts.

Disk types and prefetching.

IDE disk drives (integrated drive electronics) are not suitable for multiuser environments, even if the server is plenty of memory. This is because IDE standard does not solve simultaneous multiuser read/write requests, it simply process them in a sequential way, forcing a bottleneck or "multiuser contention".

SCSI disks instead (small computer systems interface) are conceived with multiuser requests in mind. They can solve multiple read/write requests by an intelligent queueing mechanism at the controller logic level. SCSI hardware is intelligent enough to optimize disk access in multiuser scenaries. The election of an SCSI disk+controller is always mandatory in multiuser, networked databases environments.

In spite the fact that the IDE data transfer rate is similar to the SCSI one, the hardware resolution of the queue present in SCSI devices makes them much more efficient and responsive disks than their IDE cousins. As a rule of thumb, if we can afford it, we should have a server with SCSI disks.

Prefetching minimizes the read issue with IDE disks, because they effectively reduces the disk accesses amount. Don't confuse prefetching with database buffering, the latter mechanism has another goal.

3. Nested indexed loops JOIN (based in indexed records)

It's a derived mechanism from option 1, and basically consists in take the internal table as the one with an index on the join expression. Supposing that table TYPE (R) has an index on Codrub column, then the operation would be (assuming that ARTICLE is the external table then):

For each record in table A, do:
- seek the value A.codrub using the index on table R.
  - if that value is found, then pass (A,R) records to the result set.

Please note that the existance of an index on table A is irrelevant, because only the internal table index is used. This is -basically- the kind of optimization that VFP uses in JOIN operations, as we can see by using the system function SYS(3054,2) Nótese que en

Cost=500 + (500* 20 * cost to find matching rows in TYPE through the index

For each row in ARTICLE, the cost to find an index entry on TYPE index would be:

For clustered indexes: 1,2 I/O (typical). VFP does not use clustered indexes, so...
...for B-tree indexes will be from 2 to 4 I/O. VFP uses a variant on B-tree, called B* tree, or modified B Tree, thus the performance will be closer to these values than the abovementioned

Once the data on index TYPE has been found, the cost to fetch the data on the table row itself will be:

For clustered index 1 I/O operation is typically consumed, for the complete query.
For normal indexes, up to 1 I/O operation can be consumed for each found row (worst scenary)

Please note the big difference between a clustered index and a normal one. To give the reader a further clarification, let's make some numbers:

Clustered index, ARTICLE is external and TYPE is internal:
- SCAN on ARTICLE: 500 I/O operations (500 pages on A)
- For each row in ARTICLE: 1,2 I/O to find the index page where the desired data is residing, plus 1 I/O to retrieve the exactly matching row in TYPE.
- Total cost is then 500 * (1,2+1) = 1100 I/O operations
Clustered index, TYPE is the external table and ARTICLE is internal:
- SCAN on TYPE: 1 I/O operation (1 page)
- For each row in TYPE: 1,2 I/O to find de index page where the desired data is residing, plus the cost to retrieve the matching rows in ARTICLE. Asumming an uniform distribution (that is, there is 1000 articles on 20 types = 500 articles per type), the cost to retrieve them will be 1 or 500 I/O depending on the index (1 for clustered ones, up to 500 for regular ones)
- Total cost is then: (a) with TYPE as external, best case: 1*(1,2+1) = 2,2 I/O, (b) with TYPE as internal, worst case 1*(1,2+500)=501,2 I/O.

As we can see, there is a convenience to use as the external table the one with smaller size.

4. Sort-Merge Join

The trick here consists in sort ARTICLE and TYPE by the common field, say, CODRUB. Then we scan the tables in order to perform a combination on the CODRUB field, and retrieve the matching rows. We can express easily this concept using a little pseudo-code in VFP. Study it carefully in order to understand the steps taken by the SORT-MERGE Join mechanism:

* SORT MERGE JOIN pseudocode in VFP. Actually this is not a valid code, but one to understand
* the algorithm executed by most RDBMS.
* we assume both ARTICLE (A) and TYPE (R) tables are sorted on the common fields CODRUB
* first, go to the first record on both tables
GO TOP IN R
GO TOP IN A
* then start the process
DO WHILE NOT EOF("R") AND NOT EOF("A")
	DO CASE
		CASE R.codrub=A.codrub	&& if there is a match then call cartesian procedure
			DO Cartesian	&& see Cartesian procedure below
		CASE R.codrub > A.codrub	&& if left > rigth, skip in right table
			SKIP IN A
		CASE R.codrub < A.codrub	&& if left < right, skip in left table

SKIP IN R ENDCASE ENDDO * at this point the result set should be filled and completed ******************* PROCEDURE Cartesian ******************* lVal=R.codrub && takes outer table key value DO WHILE R.codrub=lVal Code here to store R-row in temporal cursor RT SKIP IN R ENDDO DO WHILE S.codrub = lVal Code here to store S-row in temporal cursor ST ENDDO Code here to output cartesian product of RT and ST temporal cursors RETURN R is scanned once, each A group is scanned once for each matching row in R. A buffer presence can help when multiple scanning on A are executed, this facility can be very handy. Please remember that the exposed code was only for illustrative purposes, actually there is no such code to be executed by VFP, but a series of equivalent C++ built-in procedures in VFP.

In spite that this concept may seem a little bit complicated, this technique is used by all known RDBMS. Because sort tables are needed in order to use this technique, and VFP only sorts physically when the SORT command is used, we need the indexes to perform this kind of operation. Probably VFP use indexes to "see" the table sorted. This approach cannot be used in database servers unless there is a clustered index available. If not, they use regular indexes to perform a high-speed sort on-the-fly when a Sort-Merge Join is requested. In this case, the time consumption to perform the sort algorithm must be added to the entire Sort-Merge join operation cost.

In the following figure we can see the sort-merge algorithm, where the matching rows are backgrounded with the same color:

Note that in the TYPE code 22 there is no matching rows in ARTICLE. Thus, no output is produced for it.
At codrub=28 there are two matching rows in ARTICLE, the engine outputs record #2 from TYPE with records #2 and ,#3 from ARTICLE.
At codrub=31 there are three matching rows in ARTICLE, the engine outputs record #3 from TYPE with records #3,#4,#5 from ARTICLE.
And so on. Please note that TYPE table scan in only one complete scan. A table is scanned serveral times, once per TYPE matching row.

M= number of pages of external table TYPE = 1 page (20 records each)

N= number of pages ofr internal table ARTICLE = 500 pages (20 records each, 1,000 records on ARTICLE)

The cost for sort-merge joins is M * log2 M + N * log2 N + (M+N)

Bypassing heavy math concepts, base-2 logarithm is there because of the binary search presence. Basically, the cost would be:

Cost = 1 * log2(1) + 500 * log2(500) + (1+500) = 4984 I/O ops.

Joins in VFP: meeting Rushmore

Let's see then a JOIN operation between two tables:

SOLIC.DBF is a 600,000 record table. It contains medical practices, and we have borrowed it from real-world systems. Table does not fit in memory, to minimize disk cache side-effects.
PROFESIO.DBF is a small 42-record table, it contains physicians' info (professionals).
The common field is REGNO (registration number), six-character wide.

The result is very interesting. Seems that VFP effectively chooses as the external table to the smaller one, as her big sisters do.

The test query is:

SELECT a.regNO,b.prof_name FROM solic a,profesio b WHERE a.regNO=b.regNO

In VFP, our possibilites are: (a) with or without indexes. (b) with FROM solic,profesio or with FROM profesio,solic, and (c) with a.regNO on lefthand or righthand side. We averaged 10 passes on each combination:

Physical sorting	Indexes?	FROM	WHERE	Time
None	None	solictest a, profesio b	a.regNO=b.regNO	74.84
None	None	profesio b, solictest a	a.regNO=b.regNO	72.24
None	None	solictest a, profesio b	b.regNO=a.regNO	72.73
None	None	profesio b, solictest a	b.regNO=a.regNO	72.81

Please not that in spite of the query syntax, the elapsed time is essentially the same. This signifies that it does not matter how the programmer's write his/her clauses, the optimizer will decompose the sentence into basic blocks, and it will rearrange them. The external table should be PROFESIO.DBF, because is smaller -very smaller- than SOLIC.DBF. In this case, the plan was visible -SYS(3054) was set so- and reported no optimization done.

Let's see now indexes impact. We indexed SOLIC by regNO field, using a regular CDX structural index. Once done, we quit VFP and start it again in order to ensure the operating system to clear the allocated memory and free disks caches. When you launch an app in Windows NT/2000, the O.S. allocates memory and erases previously that segment by writing &HF0 or &HFF values in each memory cell of it. When the new instance of VFP is run, it will have no traces of previous sessions in memory. Disk caches are flushed as well. If we do not take this care, probably the disk cache and RDBMS buffers will distort the resulting elapsed times because all the index probably will be kept in memory, provided it was just created.

The results show a significative enhancement

Physical sorting	Indexes?	FROM	WHERE	Time
None	In A table	solic a profesio b	a.regNO=b.regNO	41.73
None	In A table	profesio b, solic a	a.regNO=b.regNO	41.93
None	In A table	solic a profesio b	b.regNO=a.regNO	40.29
None	In A table	profesio b, solictest a	b.regNO=a.regNO	39.86

Let's sort now both tables on regNO fields, using the SORT command. We run again the tests keeping in mind that we have just sorted physically the A table

Physical sorting	Indexes?	FROM	WHERE	Time
Table A	None	solic a profesio b	a.regNO=b.regNO	50.32
Table A	None	profesio b, solictest a	a.regNO=b.regNO	50.93
Table A	None	solic a profesio b	b.regNO=a.regNO	50.96
Table A	None	profesio b, solictest a	b.regNO=a.regNO	56.46

Please note the significative increment on sorted tables, with no indexes available. We have saved 22 seconds only with the physical sorting of the internal table - the biggest one. We can infer then that VFP executes some kind of sort-merge join - due to the fact that this algorithm is clearly favoured by a physical order.

Now we index on regNO again, keeping both tables physically sorted, and run the test again:

Physical sorting	Indexes?	FROM	WHERE	Time
Table A	In table A	solic a profesio b	a.regNO=b.regNO	40.56
Table A	In table A	profesio b, solictest a	a.regNO=b.regNO	45.64
Table A	In table A	solic a profesio b	b.regNO=a.regNO	46.49
Table A	In table A	profesio b, solictest a	b.regNO=a.regNO	42.65

In this case, the difference between indexed-and-unsorted and indexed-and-sorted is not relevant. Rushmore's excellent use of indexes makes almost unnecesary the physical ordering, even with a 600,000-record table such as the used in our example.

Let's see the obtained spent times in a chart:

Deletion mark role in JOIN operations

Deletion mark has noticeable impact if the number of deleted records is significative when compared with the total number of records. In our tests, SOLIC table had 600,000 records, with no records marked for deletion. Thus, elapsed times with and without indexing on DELETED() were essentially the same. Let's put the results on table:

600K records, no records marked for deletion

FROM	WHERE	Time WITH index on DELETED()	Time WITHOUT index on DELETED()
solic a, profesio b	a.regNO=b.regNO	45.38	40.56
profesio b, solic a	a.regNO=b.regNO	49.85	45.64
solic a profesio b	b.regNO=a.regNO	39.10	46.49
profesio b, solic a	b.regNO=a.regNO	40.69	42.65

To measure the impact of the deletion mark, we deleted about 30% of the SOLIC table, and run again the test. Now there are 180,000 records marked for deletion, against a total of 600,000. The obtained results are the following:

600K records, 180K records marked for deletion

FROM	WHERE	Time WITH index on DELETED()	Time WITHOUT index on DELETED()
solic a, profesio b	a.regNO=b.regNO	39.02	33.44
profesio b, solictest a	a.regNO=b.regNO	38.05	36.96
solic a profesio b	b.regNO=a.regNO	36.16	31.63
profesio b, solictest a	b.regNO=a.regNO	34.89	35.31

To get the whole scenary, let's delete 90% of the table, leaving only 10% without deletion mark. Elapsed time are now:

600K records, 540K records marked for deletion

FROM	WHERE	Time WITH index on DELETED()	Time WITHOUT index on DELETED()
solic a profesio b	a.regNO=b.regNO	21.96	27.17
profesio b, solictest a	a.regNO=b.regNO	21.51	26.71
solic a profesio b	b.regNO=a.regNO	21.32	27.08
profesio b, solictest a	b.regNO=a.regNO	22.19	26.52

We can infer that for big tables, the presence of an index on deletion mark has minimal -if none- impact. In addition, time saving is proportional -but not lineal- to the amount of records marked for deletion. Only when the percentage of deleted records was quite important, 90 percent, then the presence of the index was noticeable.

Conclusions

The corollary of all this can be resumed as follows:

Index based optimization is the best possible on VFP

If there are no indexes available, the physical table sort on the most used key when performing JOINs is the preferred mtehod, specially when large tables are involved.
Index bases optimization is so good, and the fact that the tables are unsorted remains unnoticed if a suitable index is available.
VFP alters the order of the predicates used on FROM and WHERE clauses, and also on the FOR clauses of optimizable xBASE commands.
Index on deletion mark is unnoticeable if the table is big and the percentage of deleted records is normal - less than 30%. With percentages more than 30%, you mileage may vary, and the index becomes useful when deleted records amount is excessive.
Speed gain is proportional to the percentage of deleted records (referred to the table with no deleted marks)..

All right, that's all, folks. I hope we can met together as soon as possible to keep sharing secrets and interesting concepts about our beloved Fox. See you soon.

Commented bibliography

"Visual FoxPro Developer's Guide", by Jeb Long, SAMS Publishing. It's an excellent book that covers the most of VFP themes. Based on VFP 3.0 - we didn't checked for newer versions- perhaps its best part resides on the beginning, with a great introductio to VFP ascendency in the first two or three chapters. Written by a living legend, Jeb Long was the creator of xBASE. He uses to enrich our experience with his several rememorations of implementing dBASE commands at Ashton Tate. His 30-year trajectory in the xBase lane is pretty noticeably at every chapter of the book. Foreword written by David L. Fulton, creator of FoxPro.
"Database Management Systems" by Raghu Ramakrishnan and Johannes Gherke; McGrawHill. This book provides general information about the behind-the-scene client-server databases architecture. Some chapters are only of academic interest, but some of them, such as the one regarding SQL language, contain strong concepts to the programmer who wants to enforce his/her theorical base. Even in the fact there is no perceptible biasing for any commercial RDBMS, we suspect there is great references to MS SQL server deep architecture. VFP programmer's can live without this book, but anyone who wants to know more about the background functionality of a client-server database must take a closer look to this book. CD with PDF diapositives of all the chapters, and solved problems.
"FoxPro 2.5 Advanced Developer's Handbook, by Pat Adams and Jordan Powell. Until VFP was released, this is the most advanced book we have seen. With great knowledge about programming and optimization tricks, it give us the necessary skills to get the most of FoxPro perfomance on every scenary. The chapters regarding multiuser deployment are outstanding. This is the unique book with examples of signed contracts between programmers and their clients. In spite of the fact that several pages are used to show the same code available on the included CD, this book is highly recommended. Foreword written by David L. Fulton.

Carlos A. Pérez, Logica I.A.
Carlos Alejandro Pérez (Resistencia, Chaco, Argentina), 36 years old, married, with two kids. He holds a degree on Electrical and Mechanical Engineering and is an univesitary professor since 1992, and an independent developer since 1991 (FoxBase+). Currently, beside his duty as a University professor, he is doing private developments using Visual FoxPro, 3-tier architecture, Pocket-PCs, SOAP and other technologies. He has recently bieng designated as Director for the Science and Technology Database Project of the Universidad Tecnológica Nacional.