Primary Keys Document - Level Extreme

Level Extreme platform

Subscription

Corporate profile

Products & Services

Support

Legal

Français

Primary Keys Document

Message

From

04/08/2001 21:58:23

Jim Nelson
Toronto, Ontario, Canada

04/08/2001 20:10:36

Hilmar Zonneveld
Independent Consultant
Cochabamba, Bolivia

General information

Forum:

Visual FoxPro

Category:

Databases,Tables, Views, Indexing and SQL syntax

Title:

Re: Primary Keys Document

Miscellaneous

Thread ID:

00539860

Message ID:

00539869

Views:

Only a few comments, below the subject of the comment in each case. . .

Opps, sorry. Now that I've been through the whole document I ended up having more than a few comments.

You state that the document will explain the benefits/disadvantages of natural vs surrogate, but you never do much regarding natural. In fact it is clear that, as written, this is a document designed to convince people to use surrogates and only surrogates. You should title it appropriately if you intend to leave it substantially as-is.

>Following the thread on candidate keys, where surrogate vs. natural keys were discussed, I am planning to place, in FAQ, a document explaining some of the relevant points. I invite all members of UT to provide feedback. My main interest is a) whether I missed some important argument for, or against, surrogate keys, b) a sample function to obtain a GUID.
>
>Regards, Hilmar.
>
>Proposed text for FAQ follows; formatting will be reviewed later (I copied HTML code from FrontPage Express; UT adds additional paragraph marks):
>
>
>

On Primary Keys, Foreign Keys, and Candidate Indices

>
>

Summary

>
>

This document explains the meaning of primary key, foreign key
>and candidate index in Visual FoxPro. A discussion of natural and
>surrogate keys is included, including the advantages of each, as
>well as methods for obtaining unique numbers.

>
>

The actual document:

>
>

This information was written in reaction to a discussion in
>the Universal Thread (www.levelextreme.com).

>
>

The information is oriented towards Visual FoxPro, although
>most of the concepts can be applied to other programming
>languages or database servers as well.

>
>

I invite the participants of the Universal Thread to point out
>key points missing in my arguments (especially key advantages and
>disadvantages of surrogate keys), and to provide a sample
>function to get a GUID in Visual FoxPro. The function will be
>included with appropriate credits.

>
>

Primary and Foreign Keys

>
>

In database theory, a primary key is the main
>way to identify a table. It is a field, or a combination of
>fields, that must be unique for each table. It is usually
>convenient for each table - or for most tables, at least - to
>have a primary key. In some database management systems this is
>an actual requirement.

>
Technically it doesn't "identify" a table.
It is not that it must be unique for each table, but for each record of each table.
I believe that a "Primary Key" is, by definition, unique regardless of the DBMS. It is true that most database systems support tables without them but that it is generally recommended that at least the important tables in a database have them.

>

As an example, each client can be assigned a number (for
>instance, in the United States, the Social Security Number (SSN)
>can be used).

>
>

The same information placed in other tables, as a pointer to
>the main table, is called a foreign key. For
>instance, in an invoice table, the same SSN can be used as a
>pointer to the client table, that is, to indicate for which
>client the invoice was extended.

>
Just a suggestion that "link" might be more appropriate than "pointer".

>

Natural and Surrogate Keys

>
>

The primary key can be a field, or fields, with information in
>the table that is also used for purposes other than the primary
>key. This is called a natural key, or a key
>with business value.

>
>

Another alternative - preferred by many programmers, including
>myself - is to use, as a primary key, a number generated
>automatically. The end-user never gets to see this number. This
>number would be called a surrogate key, or a key
>without business value. Some database systems have built-in
>mechanisms to generate these numbers; for instance, in Microsoft
>Access you only have to specify the field type as "auto-incremented".
>Visual FoxPro has no built-in mechanism for this, but it is
>relatively easy to achieve the same result. See below, for
>details.

>
You introduce a bias here in two ways (by the way, that may well be your intention, but if that is the case then you should state at the start of the document that its purpose is to convince the reader to adopt surrogate keys.
The first bias - you state outright that it is not only your preference but the preference of many programmers. You might well be able to say that natural keys are the preference of many programmers too.
Second bias - you go into far more detail here than you do for natural keys, including mentioning that some dtabase systems have built-in facilities to generate the surrogates. At the very least you could mention that, of course, there is no need to use that feature for natural keys.
Technicality: I doubt that I could simply specify auto-increment for CustomerName in Access.

>

Arguments about this topic can easily be as heated as
>arguments about your favorite word processor, or about your
>favorite religion. Therefore, although personally I (usually)
>favor the surrogate key approach, I will try to be as neutral as
>possible, collect information in favor of both approaches, and
>let the reader decide what method best suits his needs. (When I
>started programming, I used natural keys; eventually, my former
>boss, Andrés Frank, convinced me of the advantages of surrogate
>keys.)

>
You have already been about as biased as you can be and now you state that you are going to be neutral?!?!?! Again, unless your purpose is to convince people to use surrogates, all of this personal stuff should go.

>

Arguments in favor of Surrogate Keys

>
>

Simplicity: With natural keys, it is often difficult to
> find a field, or a combination of fields, that is unique.
> For instance, using the SSN in the example above gives
> rise to some questions: what about foreign citizens that
> have no SSN? What about duplicate SSNs? Yes, these have
> been known to exist. With surrogate keys, it is easy to
> assign an "autoincremented" field to each table.
> This one, automatically generated and never seen by the
> end-user, will naturally be unique. With natural keys, it
> is sometimes difficult to find a combination of fields
> that is always unique - and more than one field has to be
> used in some cases. These fields have to be added as

You should say "Simplicity of xxxxxxxxx:" and not just simplicity (because I expect that further down one of the negatives for surrogates will be additional complexity introduced for many things). It will be incongruous to have "simplicity" here and "added complexity" later on.

> foreign keys in other tables.

When using surrogate keys, the end-user can change the
> key field he sees - like the SSN, or a product code - at
> any moment; there is no need to propagate the change
> through other tables, since the primary key doesn't
> change. There is usually no need to actually change the
> primary key, since the end-user never gets to see it.

Size: At first sight, it seems that you need an
> additional field. However, this is usually - but not
> always - more than offset by savings in the foreign keys
> in the other tables. Since the primary key uses only 4
> bytes (for integer fields, in Visual FoxPro), the foreign
> key in other tables requires only 4 bytes, too.

Speed: it is faster for Visual FoxPro to join tables (in
> a SQL - SELECT statement, for instance) with integer keys,
> than with longer fields - or with a combination of fields.

>
I'm not sure I buy the argument about size, and you had better check the argument about speed.

>

Arguments against Surrogate Keys

>
>

Not all of the advantages mentioned above apply in all
> cases. The programmer may find it convenient, for
> different reasons, to use longer keys instead of integers
> (see below); this invalidates some of the advantages
> mentioned. A lookup table may use, for a natural key, a
> character code with only one or two bytes - less than the
> four an integer field uses. And with Cascading Update
> Triggers, it is easy (though not always fast) to cascade
> changes made in the primary key, to the foreign keys of
> other tables.

The main problem I see with surrogate keys is the
> somewhat complex programming, that requires lookups both
> ways, especially for data forms. Let's take the case of
> the client as an example. In a table that shows invoices,
> the user types the SSN, the form has to do a lookup in
> the client table to return the internal client ID, and
> store it in the invoices table. On the other hand, for
> looking at an existing invoice, the form has to take the
> internal client ID and do a lookup, in the client table,
> for the SSN. Creating form objects to edit data like this
> can be fairly complex (fortunately, through inheritance,
> a generic solution is possible).

When joining data for reports, sometimes additional
> fields have to be retrieved. Example: the SSN. If this is
> saved as foreign key in the invoice, it is immediately
> available. When using a surrogate key, this field has to
> be fetched from the other table.

Even though we now have a unique primary key, we may
> still find it convenient to find a field, or a
> combination of fields, that uniquely identifies each
> record to the end user! See the section on Candidate
> Indices, below.

>
First, you go (as predicted) and shoot down most of your "advantages". I think that the reader will be a little pissed off, not to mention confused.
Secondly, I can think of three other disadvantages:
1) The database designer might easily be unwittingly swayed to come up with a design that does not reflect the business flow/processes and most designers feel that it is best practise to have a DB design mirror the business whenever possible.
2) If a system calls for user to be able to generate reports/lists based on self-selectable criteria, surrogates make this almost impossible.
3) It is almost certain that the natural primary will otherwise have to be held as a CANDIDATE key anyway, introducing additional overhead and additional complexity because the need to confirm uniqueness remains, with the same problems that are allegedly avioded by using surrogates.

>

Options for generating Unique Numbers

>
>

The first option that comes to mind, to many people, is
> to simply use the record number (recno()) as primary key.
> After all, that way we don't need an additional field!
> This is not an appropriate solution. The main problem is
> that, when we PACK a table, the position of each record
> changes.

We can use an integer field, automatically incremented
> for each record. This approach is fairly straightforward.
> Create a table with fields sequence C(20), nextnum I.
> Copy the following function (SerialNumber) into your
> database stored procedures. Call the function from the
> default value of the primary key field, for instance,
> default value of clientid (in client table) =
> SerialNumber("client"). This allows for up to 2e9
> different values. Some numbers may be skipped (if the
> user creates a record, the number is automatically
> assigned; if the user then clicks on "Undo",
> the changes are reverted, but the number is already used).
> This is not a problem with numbers which the user never
> sees - our main worry is that the primary key be unique.

>
>

FUNCTION SERIALNUMBER(tnSequence)
>	* Get serial number. Used mainly to generate primary keys.
>	* The easiest way to achieve this is to call this function from a field's default value.
>	tnSequence = lower(tnSequence)
>	local lnSelect
>	lnSelect = select()
>	if used("serialnumber")
>		select serialnumber
>	else
>		select 0
>		use serialnumber
>	endif
>	set order to "sequence"
>	seek padr(tnSequence, len(sequence))
>	if not found()
>		append blank
>		replace sequence with tnSequence, nextnum with 1
>	endif
>	local lnReturnValue
>	if lock()
>		lnReturnValue = nextnum
>		replace nextnum with nextnum + 1
>	else
>		lnReturnValue = -1
>	endif
>	unlock
>	select (lnSelect)
>	return lnReturnValue
>ENDFUNC
>

>
>

This function can also be used to generate sequential numbers
>which the user does see, like invoice numbers. In this case, to
>avoid skipping numbers, the function should be called only when
>saving the record, and the record should be saved within a
>transaction. Thus, if the save fails, the changes can be undone.
>In this case, table SerialNumber must be part of the database (that
>is, not a free table), otherwise it can't participate in
>transactions.

>
>

Some programmers prefer to use GUIDs for primary keys. As
> far as I could gather, GUID stands for "global
> unique identifier". It is a unique value generated
> by the Windows API. Actually, it is more correct to say
> it is assumed to be unique: because of the fact that it
> is created randomly (as far as I know), and because of
> its size (128 bits, or 16 bytes), it is highly unlikely
> that the same value is ever used again - either on the
> same machine, or on another one (there are 3.4e38
> different 128-bit values). GUIDs are fairly large,
> therefore some of the advantages of using surrogate keys
> (size, speed) don't apply when using them.

The reason to use GUIDs is because when several users
> have to input data, and can't connect to the central
> network all the time (because of geographical distance),
> a method must be used to ensure that primary keys are
> unique. This can also be done by assigning a unique
> number to each user; let's say this unique user ID is
> stored in variable gnUserId. Then, either the primary key
> can be made up of two fields (in the client example:
> field client, and field UserId), or the information can
> be combined in a single field. This latter choice makes
> it easier to join tables, for referential integrity, or
> for a query or view. The following expression might be
> placed in the default value of field ClientId: bintoc(gnUserId,
> 2) + bintoc(SerialNumber("Client")). Note that
> for this example, the primary key - and the corresponding
> foreign keys - should be defined as C(6). This allows for
> up to 65,000 different users (or offices). Function
> bintoc() requires VFP 5 or later.

>
>

Candidate indices and unique values

>
>

Theory: a primary key is the main form of identifying a record.
>Naturally, the field or fields that make up a primary key must be
>unique. However, we may also require other fields, or
>combinations of fields, to be unique. It is for this purpose that
>Visual FoxPro lets us create candidate indices.
>Uniqueness (in primary indices and candidate indices) is enforced
>through the index. In other languages, candidate indices would be
>called "unique indices". In Visual FoxPro, the keyword
>"unique" was used previously (in FoxPro) for another
>purpose. Unique indices are kept for backwards compatibility, but
>apart from that, they have no practical application in Visual
>FoxPro (to obtain all the different values for a field, use a
>SELECT - SQL statement instead, either with the DISTINCT keyword,
>or with GROUP BY).

>
>

In each table, there can be only one primary index, but there
>can be several candidate indices. Other than what I said already,
>Visual FoxPro doesn't make much difference between a primary and
>a candidate index.

>
>

A technical problem arises, for instance, when using candidate
>indices for a SSN. After deleting a record, it is still in the
>table and participates in uniqueness validation. In other words,
>when erasing a record and then re-using the key value, Visual
>FoxPro will protest about the duplicate value. One solution for
>this is to use a filter on an index. Remember that a filtered
>index can't participate in Rushmore Optimization. This may be
>good, or bad, depending on the circumstances: see my document on
>Rushmore Optimization for details. If Rushmore Optimization is
>desired for a specific field, sometimes it may be necessary to
>define two indices: a filtered one, for the primary or candidate
>index, to ensure uniqueness; and a non-filtered one, of "regular"
>type, for Rushmore Optimization.

>
You neglect to mention that only VFP suffers from the DELETED problem. The other popular DBMS' all do physical deletes (or at least make it look that way.

>

>
>

I truly hope that this is helpful to you.

JimN

Map

View

Click here to load this message in the networking platform