Database Design

by Jo K. Horan

Part 2 - Building a Model

Last month, in the first of this series of database design articles, we defined some common database terms and explained what they meant. This time we’ll talk about “modeling” your data.

This includes deciding what belongs in the same table and what should go into different tables, how to link tables together to provide the most useful and accurate information with the least amount of work, and how to be prepared for future additions and changes to the data.

We’ll need a simple example to illustrate this process. Pretend that you’re the membership chairman for a local volunteer organization. Your primary duty is to maintain a mailing list of all the group’s members, along with phone numbers, the date dues were paid, and so on. You also need to know who the group’s current officers are, so that if someone calls you with a question, you can refer them to the proper person. You’ll get each individual’s original information from a membership application, but you’ll depend on phone calls for changes. And once a year you’ll have to update the list of officers. It sounds easy at first, but you know from experience that things can get complicated pretty quickly, and you want to make sure that you do it right.

There are two different approaches that you can take. Usually people start out with a typical two-dimensional spreadsheet layout, which they expand as they go along until they end up with something so disorganized that they can’t really manage it any more. Then they try to figure out what went wrong, what should be changed, and how to rearrange the data to meet their current needs. This happens so often that we have a name for it. It’s called the “analytic” approach, and there are certain clues to look for and basic steps to take to resolve the problems. (We’ll discuss this at greater length later on.) The other technique is called the “synthetic” approach, and it’s a better way to go. You start from scratch, figuring out what data you have and what questions you will need to ask, and you design your database right the first time. The structure you end up with may appear a bit more complex than necessary, but if it’s done right it can accept the inevitable additions and changes without needing any major revisions. This is where we’ll start.

The first step is to name the tables. In our example, we have two different “entities”. The first is the name-and-address table, which we’ll call “MEMBERS”. Each row contains data about one individual, and each row has to be different from every other row. Since you can’t always depend on unique names, you should include a code number of some sort as a “key” field to identify each individual. If your group has a membership number, that’s what you should use. If not, you could try social security numbers, or you could create your own sequential field. I like to use a short character field that begins with a letter (like “M” for member) and has enough space for the number of entries I expect to need. For example, numbers like “M001” through “M100” would identify members one through one hundred, with plenty of room for later additions. Codes like this have other advantages over names; they’re shorter, and they’re not as likely to be incorrectly entered. You tend to be more careful when you’re typing codes instead of words.

The second table this example requires is a list of “OFFICERS”. This table needs a key field called “Position” that contains the titles of the various officers, such as “President”, “Secretary”, “Treasurer”, and so on. The table also needs to identify the person currently holding the office, which you can do in one of two ways. For convenience, you could retype the officer’s name just as it is in the “MEMBERS” table. But a better way would be to include the officer’s member number instead, to provide a link to the original table where all the data about that person already exists. This defines a “one-to-one relationship” between the two tables, because each entry in the second table refers to only one entry in the first table. This approach has several advantages. You’re not retyping lengthy names, you’re not likely to make mistakes, and all the information you might need later on (phone numbers, et cetera) will be readily available when the time comes. You do need to make sure that the “linking” field is the same size and field type in both tables, and you should try to use the same field name as well, even though different names may be acceptable.

Okay, I can hear you asking the next question. “Why don’t I just include a column in the original table that identifies each officer?” There are several reasons not to do this. The first is the database design principle that says that being an officer is not the same thing as being a member, and therefore these are separate “entities” that belong in separate tables. The next consideration is a practical one. Since you have many members, but only a few officers, you’re wasting a lot of space in your original table for blank fields that don’t contain an officer’s title except in a very few cases. And the officers are harder to spot in the big table, since they’re probably not close together in the name-and-address (or zip code) order that you normally use for your mailing list. Another disadvantage is that you’re doubling the amount of work required when officers change. You’ll have to find each outgoing officer, delete the title in that record, find the incoming officer’s entry, and retype the title there. But if the officers are listed in a separate table, as they should be, all you have to do is replace each outgoing officer’s member number with the corresponding new one, and you’re through. Much better.

Once you’ve defined the necessary information in each table, you should think about other bits of data that might be useful. For example, the “OFFICERS” table could include dates for the current term, or the names that the present officers like to be known by. Since these names are useful only for referring phone calls to the right people, they definitely don’t belong on the mailing list, but on the list of officers instead. Data items like these are considered to be “attributes” of the “entity” to which they’re assigned, and it’s important to make sure that they’re added to the right table for the right reason. It’s not only what the data consists of, but also its intended purpose, that’s important.

That’s enough of an introduction to the “synthetic” approach to database design. Let’s briefly mention the “analytic” or “normalization” approach, which looks for previous errors and tries to correct them. For example, suppose you’ve been asked to maintain a list of committees for your organization, including the name of the committee, the name and phone number of the current committee chairperson, the date of the committee’s next meeting, and the names of all the members of that committee. You’ve started out with a spreadsheet where each row contains all this information for each different committee. By including the names of all the committee members on the same row, you’ve already violated Rule 1: No Repeating Groups. Since the number of members can vary for each committee, you’re wasting space by allowing enough room on each row to include the maximum number of members, when most of the time you don’t need all those columns. The list of committee members belongs in a separate table, with a “one-to-many” relationship that links those members to their different committees.

You’ve also violated Rule 2: No Redundant Data. Each row identifies the current chairperson for that committee, but that person’s phone number depends only on who they happen to be, not on which committee they’re currently chairing. So including that phone number in this table makes the phone number redundant data. You need a “one-to-one” link based on member number, like the one you created for the officers, to point to the chairperson’s entry in the main membership table. That’s where you can find the phone number. And there are other rules to be aware of, but they apply to more complicated examples than those we’ve used here. Meanwhile, you can see that just recognizing the potential pitfalls can help you get off to a much better start the next time you need to set up a database.

Now let’s review what we’ve talked about here. We’ve described two approaches for defining a database, whether you’re starting from scratch or trying to fix previously existing problems. We’ve discussed “entities” and “key fields” and “relationships”, which can be “one-to-many” or “one-to-one”. You can also have “many-to-many” relationships, which we’ll save for next month’s article, along with some diagrams to illustrate how these relationships work. In the meantime, plan to come to HAL-PC headquarters for the Alpha Four / Alpha Five SIG meeting at 6:30 p.m. on Monday, August 10th, and we’ll show you some on-line examples of these various types of database configurations, and try to answer any questions you may have about these topics. Anything you learn from us will apply to whatever database software you prefer to use, since these database design principles are not unique to any one product. We’ll look forward to seeing you on August 10th!

Jo K. Horan is a database developer, consultant, and trainer, specializing in Alpha Four and Alpha Five. She is leader of the Alpha Four / Alpha Five Special Interest Group within HAL-PC, the Houston Area League of PC Users. She may be reached at 713-723-6709 or by mail at DataBasics, 9806 Cliffwood, Houston, TX 77096.

E-mail me at webmaster@hal-pc.org with any comments you have and tell me what you want to see here.

Back to the Magazine Home Page