Overview
The purpose of this article is to give an overview of normalization. Basically, normalization refers to having an efficient table structure. I will not discuss the famous "first to fifth normal forms" - if you want that information, enough texts exist about it in other places (search sites like Google, or PcWebopedia, for terms like "normalization", or "third normal form"). Instead, I will outline a few simple rules to help you plan your table structures efficiently.
The information in this article is relevant to FoxPro, Visual FoxPro, or any other database management system. For instance, the same considerations apply when you decide to use Visual FoxPro as a front-end for a client/server system, with Oracle, SQL Server, etc.
Introduction
The basic idea of normalization is to distribute data among tables efficiently. An efficient table structure helps you save space, but there are also other, usually more important, criteria. For instance, in a sales system, if you save all the data related to the client in the main invoice table, you have to repeat this information for each invoice, and this wastes space. However, there are more serious problems, too. For instance, if some information, say, the client address, changes, you have to make the change for each invoice, which can be both complicated to program, and slow. Or, you do the update for one invoice, and forget to update another one - now, which is the valid data? ("update anomaly"). Moreover, if you delete the last invoice for a client, you lose the information about the client ("delete anomaly"). Clearly, it is convenient to have a separate table for clients.
In general, it is convenient to have a separate table for each separate "entity" you can identify - in a sales system, this might be clients, products and invoices. But this general definition does not suffice to solve all your design problems - for instance, in practice, it results convenient to have the invoice header and the invoice details (products sold) in separate tables.
Typical problems
Let's start by showing some typical problems that can occur when information is not stored efficiently. A university decides to record student grades. It decides that no student will register in more than 6 subjects in a semester, and therefore creates a "grades" table, where the following information is stored: student code, semester, subject, grade. The last two fields are repeated 5 more times.
Now, just try to imagine the nightmare of obtaining a list of all students registered in a certain subject. Also, the limit of 6 subjects per student is arbitrary - the rules can easily change next semester, or exceptions can be granted, to allow for a 7th. or 8th. subject for a student. Finally, reserving fields for a large amount of subjects wastes disk space for the cases of students who register in only a few subjects.
Let's hope that the programmers are clever enough not to start with any serious programming before they come up with a more flexible data structure.
In this case, the solution is to put every subject for every student, with its corresponding grade, in a separate record. Now, commands like LIST FOR SUBJECT = varSubject are possible (or BROWSE, or SCAN, with a similar condition) are possible. Clearly, such simple constructs are impossible with the previous structure.
Also, now an arbitrary number of grades can be stored for each student, and no space gets wasted for the non-used subjects.
Primary and Foreign Keys
Usually, each table needs a primary key (a field, or fields, that uniquely identify each record). This key is used in other tables as a reference. For instance, the client table might have a field that has a unique value for each client. This value is used in an invoice, to reference the corresponding client.
This primary key might be a client code assigned by the company, or, for instance, a sequential number which the end-user never sees. (See my article on primary keys, in the Universal Thread Magazine, for more details about primary and foreign keys.)
Rules based on relations
The simple rules I suggest you consider are based on the kind of relation between to pieces of data: whether the relation is one-to-one, one-to-many, or many-to-many.
One-to-one: A one-to-one relation (a piece of data is used, at most, once) is an indication that both pieces of data should simply be different fields in the same table. Please note that this includes the case that you need to store some simple data (like a date, age, name, etc.), which doesn't have to be unique, but for which you don't need to store additional information.
One-to-many: A one-to-many relation is an indication that the information has to go into two separate tables. The many-side is a separate table, that has a pointer (a foreign key) - usually a single field - that points to the one-side.
Many-to-many: A many-to-many relation requires a third table, to resolve into two one-to-many relations.
Note that for a database to be considered "normalized", information that can repeat (like the different subjects and grades for each student, mentioned above) is always stored in separate records, not in separate fields. This may require additional tables, which can be determined with the above rules.
Examples
Let's now see a few examples, from a sales system:
Self-relations
When a table is related to itself, the number of tables described in the rules above decreases by one. Examples:
Case study: registration system
To see how this works in practice, let's do a more detailed analysis of the student registration and grades system, for a university, mentioned previously.
At the beginning of each semester, a student registers for several subjects. At the end of the semester, the student receives a grade for the subject.
In the above paragraph, we can quickly identify several "entities", for each of which we need to save additional information, and for each of which we therefore need a separate table. While we think about what additional information we need to store, the need for some additional tables soon becomes clear.
In actual practice additional information is required (for instance: payments; course requirements by career, to compare whether a student has fulfilled the requirements for the degree; and other information), I will leave it at these brief examples.
Denormalization
When following the rules outlined above, your tables will usually follow most formal definitions of "normalization". However, I suggest to use the rules above as general guidelines. Full normalization is not always the most efficient way to do things. For instance, you might consider the creation of a separate address table, as outlined above, as too much trouble, and prefer to leave two fields for emails (email1, email2). While this is not completely normalized, you may consider that you don't need a third email anyway, that you don't need to access all emails systematically (for automatic processing, you will only use the default email, email1, anyway), and that maintaining an additional table requires additional overhead.
Summary
Whether you use a more formal definition, which you can find in other texts, or the rules outlined above, proper normalization of your database structures can often help you save space, avoid certain anomalies, and add flexibility to your database system.