by Jo K. Horan
Last month, in the first of this series of database design articles, we defined some common database terms and explained what they meant. This time well talk about modeling your data.
This includes deciding what belongs in the same table and what should go into different tables, how to link tables together to provide the most useful and accurate information with the least amount of work, and how to be prepared for future additions and changes to the data.
Well need a simple example to illustrate this process. Pretend that youre the membership chairman for a local volunteer organization. Your primary duty is to maintain a mailing list of all the groups members, along with phone numbers, the date dues were paid, and so on. You also need to know who the groups current officers are, so that if someone calls you with a question, you can refer them to the proper person. Youll get each individuals original information from a membership application, but youll depend on phone calls for changes. And once a year youll have to update the list of officers. It sounds easy at first, but you know from experience that things can get complicated pretty quickly, and you want to make sure that you do it right.
There are two different approaches that you can take. Usually people start out with a typical two-dimensional spreadsheet layout, which they expand as they go along until they end up with something so disorganized that they cant really manage it any more. Then they try to figure out what went wrong, what should be changed, and how to rearrange the data to meet their current needs. This happens so often that we have a name for it. Its called the analytic approach, and there are certain clues to look for and basic steps to take to resolve the problems. (Well discuss this at greater length later on.) The other technique is called the synthetic approach, and its a better way to go. You start from scratch, figuring out what data you have and what questions you will need to ask, and you design your database right the first time. The structure you end up with may appear a bit more complex than necessary, but if its done right it can accept the inevitable additions and changes without needing any major revisions. This is where well start.
The first step is to name the tables. In our example, we have two different entities. The first is the name-and-address table, which well call MEMBERS. Each row contains data about one individual, and each row has to be different from every other row. Since you cant always depend on unique names, you should include a code number of some sort as a key field to identify each individual. If your group has a membership number, thats what you should use. If not, you could try social security numbers, or you could create your own sequential field. I like to use a short character field that begins with a letter (like M for member) and has enough space for the number of entries I expect to need. For example, numbers like M001 through M100 would identify members one through one hundred, with plenty of room for later additions. Codes like this have other advantages over names; theyre shorter, and theyre not as likely to be incorrectly entered. You tend to be more careful when youre typing codes instead of words.
The second table this example requires is a list of OFFICERS. This table needs a key field called Position that contains the titles of the various officers, such as President, Secretary, Treasurer, and so on. The table also needs to identify the person currently holding the office, which you can do in one of two ways. For convenience, you could retype the officers name just as it is in the MEMBERS table. But a better way would be to include the officers member number instead, to provide a link to the original table where all the data about that person already exists. This defines a one-to-one relationship between the two tables, because each entry in the second table refers to only one entry in the first table. This approach has several advantages. Youre not retyping lengthy names, youre not likely to make mistakes, and all the information you might need later on (phone numbers, et cetera) will be readily available when the time comes. You do need to make sure that the linking field is the same size and field type in both tables, and you should try to use the same field name as well, even though different names may be acceptable.
Okay, I can hear you asking the next question. Why dont I just include a column in the original table that identifies each officer? There are several reasons not to do this. The first is the database design principle that says that being an officer is not the same thing as being a member, and therefore these are separate entities that belong in separate tables. The next consideration is a practical one. Since you have many members, but only a few officers, youre wasting a lot of space in your original table for blank fields that dont contain an officers title except in a very few cases. And the officers are harder to spot in the big table, since theyre probably not close together in the name-and-address (or zip code) order that you normally use for your mailing list. Another disadvantage is that youre doubling the amount of work required when officers change. Youll have to find each outgoing officer, delete the title in that record, find the incoming officers entry, and retype the title there. But if the officers are listed in a separate table, as they should be, all you have to do is replace each outgoing officers member number with the corresponding new one, and youre through. Much better.
Once youve defined the necessary information in each table, you should think about other bits of data that might be useful. For example, the OFFICERS table could include dates for the current term, or the names that the present officers like to be known by. Since these names are useful only for referring phone calls to the right people, they definitely dont belong on the mailing list, but on the list of officers instead. Data items like these are considered to be attributes of the entity to which theyre assigned, and its important to make sure that theyre added to the right table for the right reason. Its not only what the data consists of, but also its intended purpose, thats important.
Thats enough of an introduction to the synthetic approach to database design. Lets briefly mention the analytic or normalization approach, which looks for previous errors and tries to correct them. For example, suppose youve been asked to maintain a list of committees for your organization, including the name of the committee, the name and phone number of the current committee chairperson, the date of the committees next meeting, and the names of all the members of that committee. Youve started out with a spreadsheet where each row contains all this information for each different committee. By including the names of all the committee members on the same row, youve already violated Rule 1: No Repeating Groups. Since the number of members can vary for each committee, youre wasting space by allowing enough room on each row to include the maximum number of members, when most of the time you dont need all those columns. The list of committee members belongs in a separate table, with a one-to-many relationship that links those members to their different committees.
Youve also violated Rule 2: No Redundant Data. Each row identifies the current chairperson for that committee, but that persons phone number depends only on who they happen to be, not on which committee theyre currently chairing. So including that phone number in this table makes the phone number redundant data. You need a one-to-one link based on member number, like the one you created for the officers, to point to the chairpersons entry in the main membership table. Thats where you can find the phone number. And there are other rules to be aware of, but they apply to more complicated examples than those weve used here. Meanwhile, you can see that just recognizing the potential pitfalls can help you get off to a much better start the next time you need to set up a database.
Now lets review what weve talked about here. Weve described two approaches for defining a database, whether youre starting from scratch or trying to fix previously existing problems. Weve discussed entities and key fields and relationships, which can be one-to-many or one-to-one. You can also have many-to-many relationships, which well save for next months article, along with some diagrams to illustrate how these relationships work. In the meantime, plan to come to HAL-PC headquarters for the Alpha Four / Alpha Five SIG meeting at 6:30 p.m. on Monday, August 10th, and well show you some on-line examples of these various types of database configurations, and try to answer any questions you may have about these topics. Anything you learn from us will apply to whatever database software you prefer to use, since these database design principles are not unique to any one product. Well look forward to seeing you on August 10th!
Jo K. Horan is a database developer, consultant, and trainer, specializing in Alpha Four and Alpha Five. She is leader of the Alpha Four / Alpha Five Special Interest Group within HAL-PC, the Houston Area League of PC Users. She may be reached at 713-723-6709 or by mail at DataBasics, 9806 Cliffwood, Houston, TX 77096.
E-mail me at webmaster@hal-pc.org with any comments you have and tell me what you want to see here.