Following the thread on candidate keys, where surrogate vs. natural keys were discussed, I am planning to place, in FAQ, a document explaining some of the relevant points. I invite all members of UT to provide feedback. My main interest is a) whether I missed some important argument for, or against, surrogate keys, b) a sample function to obtain a GUID.
Regards, Hilmar.
Proposed text for FAQ follows; formatting will be reviewed later (I copied HTML code from FrontPage Express; UT adds additional paragraph marks):
On Primary Keys, Foreign Keys, and Candidate Indices
Summary
This document explains the meaning of primary key, foreign key
and candidate index in Visual FoxPro. A discussion of natural and
surrogate keys is included, including the advantages of each, as
well as methods for obtaining unique numbers.
The actual document:
This information was written in reaction to a discussion in
the Universal Thread (www.levelextreme.com).
The information is oriented towards Visual FoxPro, although
most of the concepts can be applied to other programming
languages or database servers as well.
I invite the participants of the Universal Thread to point out
key points missing in my arguments (especially key advantages and
disadvantages of surrogate keys), and to provide a sample
function to get a GUID in Visual FoxPro. The function will be
included with appropriate credits.
Primary and Foreign Keys
In database theory, a primary key is the main
way to identify a table. It is a field, or a combination of
fields, that must be unique for each table. It is usually
convenient for each table - or for most tables, at least - to
have a primary key. In some database management systems this is
an actual requirement.
As an example, each client can be assigned a number (for
instance, in the United States, the Social Security Number (SSN)
can be used).
The same information placed in other tables, as a pointer to
the main table, is called a foreign key. For
instance, in an invoice table, the same SSN can be used as a
pointer to the client table, that is, to indicate for which
client the invoice was extended.
Natural and Surrogate Keys
The primary key can be a field, or fields, with information in
the table that is also used for purposes other than the primary
key. This is called a natural key, or a key
with business value.
Another alternative - preferred by many programmers, including
myself - is to use, as a primary key, a number generated
automatically. The end-user never gets to see this number. This
number would be called a surrogate key, or a key
without business value. Some database systems have built-in
mechanisms to generate these numbers; for instance, in Microsoft
Access you only have to specify the field type as "auto-incremented".
Visual FoxPro has no built-in mechanism for this, but it is
relatively easy to achieve the same result. See below, for
details.
Arguments about this topic can easily be as heated as
arguments about your favorite word processor, or about your
favorite religion. Therefore, although personally I (usually)
favor the surrogate key approach, I will try to be as neutral as
possible, collect information in favor of both approaches, and
let the reader decide what method best suits his needs. (When I
started programming, I used natural keys; eventually, my former
boss, Andrés Frank, convinced me of the advantages of surrogate
keys.)
Arguments in favor of Surrogate Keys
Simplicity: With natural keys, it is often difficult to
find a field, or a combination of fields, that is unique.
For instance, using the SSN in the example above gives
rise to some questions: what about foreign citizens that
have no SSN? What about duplicate SSNs? Yes, these have
been known to exist. With surrogate keys, it is easy to
assign an "autoincremented" field to each table.
This one, automatically generated and never seen by the
end-user, will naturally be unique. With natural keys, it
is sometimes difficult to find a combination of fields
that is always unique - and more than one field has to be
used in some cases. These fields have to be added as
foreign keys in other tables. When using surrogate keys, the end-user can change the
key field he sees - like the SSN, or a product code - at
any moment; there is no need to propagate the change
through other tables, since the primary key doesn't
change. There is usually no need to actually change the
primary key, since the end-user never gets to see it. Size: At first sight, it seems that you need an
additional field. However, this is usually - but not
always - more than offset by savings in the foreign keys
in the other tables. Since the primary key uses only 4
bytes (for integer fields, in Visual FoxPro), the foreign
key in other tables requires only 4 bytes, too. Speed: it is faster for Visual FoxPro to join tables (in
a SQL - SELECT statement, for instance) with integer keys,
than with longer fields - or with a combination of fields.Arguments against Surrogate Keys
Not all of the advantages mentioned above apply in all
cases. The programmer may find it convenient, for
different reasons, to use longer keys instead of integers
(see below); this invalidates some of the advantages
mentioned. A lookup table may use, for a natural key, a
character code with only one or two bytes - less than the
four an integer field uses. And with Cascading Update
Triggers, it is easy (though not always fast) to cascade
changes made in the primary key, to the foreign keys of
other tables. The main problem I see with surrogate keys is the
somewhat complex programming, that requires lookups both
ways, especially for data forms. Let's take the case of
the client as an example. In a table that shows invoices,
the user types the SSN, the form has to do a lookup in
the client table to return the internal client ID, and
store it in the invoices table. On the other hand, for
looking at an existing invoice, the form has to take the
internal client ID and do a lookup, in the client table,
for the SSN. Creating form objects to edit data like this
can be fairly complex (fortunately, through inheritance,
a generic solution is possible). When joining data for reports, sometimes additional
fields have to be retrieved. Example: the SSN. If this is
saved as foreign key in the invoice, it is immediately
available. When using a surrogate key, this field has to
be fetched from the other table. Even though we now have a unique primary key, we may
still find it convenient to find a field, or a
combination of fields, that uniquely identifies each
record to the end user! See the section on Candidate
Indices, below.Options for generating Unique Numbers
The first option that comes to mind, to many people, is
to simply use the record number (recno()) as primary key.
After all, that way we don't need an additional field!
This is not an appropriate solution. The main problem is
that, when we PACK a table, the position of each record
changes. We can use an integer field, automatically incremented
for each record. This approach is fairly straightforward.
Create a table with fields sequence C(20), nextnum I.
Copy the following function (SerialNumber) into your
database stored procedures. Call the function from the
default value of the primary key field, for instance,
default value of clientid (in client table) =
SerialNumber("client"). This allows for up to 2e9
different values. Some numbers may be skipped (if the
user creates a record, the number is automatically
assigned; if the user then clicks on "Undo",
the changes are reverted, but the number is already used).
This is not a problem with numbers which the user never
sees - our main worry is that the primary key be unique.FUNCTION SERIALNUMBER(tnSequence)
tnSequence = lower(tnSequence)
local lnSelect
lnSelect = select()
if used("serialnumber")
select serialnumber
else
select 0
use serialnumber
endif
set order to "sequence"
seek padr(tnSequence, len(sequence))
if not found()
append blank
replace sequence with tnSequence, nextnum with 1
endif
local lnReturnValue
if lock()
lnReturnValue = nextnum
replace nextnum with nextnum + 1
else
lnReturnValue = -1
endif
unlock
select (lnSelect)
return lnReturnValue
ENDFUNC
This function can also be used to generate sequential numbers
which the user does see, like invoice numbers. In this case, to
avoid skipping numbers, the function should be called only when
saving the record, and the record should be saved within a
transaction. Thus, if the save fails, the changes can be undone.
In this case, table SerialNumber must be part of the database (that
is, not a free table), otherwise it can't participate in
transactions.
Some programmers prefer to use GUIDs for primary keys. As
far as I could gather, GUID stands for "global
unique identifier". It is a unique value generated
by the Windows API. Actually, it is more correct to say
it is assumed to be unique: because of the fact that it
is created randomly (as far as I know), and because of
its size (128 bits, or 16 bytes), it is highly unlikely
that the same value is ever used again - either on the
same machine, or on another one (there are 3.4e38
different 128-bit values). GUIDs are fairly large,
therefore some of the advantages of using surrogate keys
(size, speed) don't apply when using them. The reason to use GUIDs is because when several users
have to input data, and can't connect to the central
network all the time (because of geographical distance),
a method must be used to ensure that primary keys are
unique. This can also be done by assigning a unique
number to each user; let's say this unique user ID is
stored in variable gnUserId. Then, either the primary key
can be made up of two fields (in the client example:
field client, and field UserId), or the information can
be combined in a single field. This latter choice makes
it easier to join tables, for referential integrity, or
for a query or view. The following expression might be
placed in the default value of field ClientId: bintoc(gnUserId,
2) + bintoc(SerialNumber("Client")). Note that
for this example, the primary key - and the corresponding
foreign keys - should be defined as C(6). This allows for
up to 65,000 different users (or offices). Function
bintoc() requires VFP 5 or later.Candidate indices and unique values
Theory: a primary key is the main form of identifying a record.
Naturally, the field or fields that make up a primary key must be
unique. However, we may also require other fields, or
combinations of fields, to be unique. It is for this purpose that
Visual FoxPro lets us create candidate indices.
Uniqueness (in primary indices and candidate indices) is enforced
through the index. In other languages, candidate indices would be
called "unique indices". In Visual FoxPro, the keyword
"unique" was used previously (in FoxPro) for another
purpose. Unique indices are kept for backwards compatibility, but
apart from that, they have no practical application in Visual
FoxPro (to obtain all the different values for a field, use a
SELECT - SQL statement instead, either with the DISTINCT keyword,
or with GROUP BY).
In each table, there can be only one primary index, but there
can be several candidate indices. Other than what I said already,
Visual FoxPro doesn't make much difference between a primary and
a candidate index.
A technical problem arises, for instance, when using candidate
indices for a SSN. After deleting a record, it is still in the
table and participates in uniqueness validation. In other words,
when erasing a record and then re-using the key value, Visual
FoxPro will protest about the duplicate value. One solution for
this is to use a filter on an index. Remember that a filtered
index can't participate in Rushmore Optimization. This may be
good, or bad, depending on the circumstances: see my document on
Rushmore Optimization for details. If Rushmore Optimization is
desired for a specific field, sometimes it may be necessary to
define two indices: a filtered one, for the primary or candidate
index, to ensure uniqueness; and a non-filtered one, of "regular"
type, for Rushmore Optimization.
Copyright (C) 2001, Hilmar Zonneveld. This text may be copied
freely.
Difference in opinions hath cost many millions of lives: for instance, whether flesh be bread, or bread be flesh; whether whistling be a vice or a virtue; whether it be better to kiss a post, or throw it into the fire... (from Gulliver's Travels)