Tuesday, June 7, 2011

Normalizing Repeated Columns: Yes/No Fields (Part1)

In my earlier series (The Problem of Repeated Columns), I defined repeated columns and talked about the data integrity problems associated with them. I also showed several different examples of repeated columns and the how difficult it is to query repeated columns as compared to the normalized equivalent. If you haven't read these yet, it would be worthwhile to read first. Similarly, if you are not familiar with the concept of Normalization, you should read my blog series What is Normalization?

One of the comments suggested I should discuss how to convert a table with repeated columns into a normalized table structure. I thought that was a really good idea, so I'm going to spend the next few posts doing so. The difficulty in discussing this issue, however, is that the exact solution differs with each table, so there is no single solution. Fortunately, in the last series, I showed 5 different tables, each with a slightly different structure, so while I cannot show a single solution that will work for every table with repeated columns, hopefully, one of the following will work for most cases.

Others in this series:
Patient Symptoms

In my post Querying Repeated Columns: Multiple IIFs, I discussed an example of a Patient table with repeated Yes/No columns that looked like this:

Figure 1: Patient table with repeated, Yes/No columns

Normalized to remove the repeated columns, the tables relationships would look like this:

Figure 2: Relationships for the Patient Symptom tables.

So how do I get the data from Figure 1 to Figure 2? Well, first of all, it depends on whether this is a one-time process where you are creating new normalized tables, or whether it is an on-going process where you are appending the data to existing, normalized tables.

In either case, however, not only do I have to split the records into separate tables, I also have to preserve the relationships between those records, so I have to do it is a specific order.


One-Time Process

Suppose I'm given a spreadsheet of data like this:



And I need to create a normalized database from it. I've already determined the tables I need (Figure 2 above), but if you're not certain what your table structure should be, you should read through my blog serie: What is Normalization, Entity-Relationship Diagramming, and The Normal Forms.

Link the spreadsheet into a database.

First of all, I need to get the spreadsheet into the database. I find it is preferable to link rather than import the file. Since I'm not going to be changing the data, there's no reason to import it. Once the file is linked it acts just like a table, and I can begin the process of normalizing it.

Step 1: Remove the common Patient elements.

Since I need to create a new table (called Patient) to hold these values, I'll use a Make-Table query:

SELECT Patient_RC.Patient INTO Patient
FROM Patient_RC
GROUP BY Patient_RC.Patient;


Or in the Query Builder:



The resulting table looks like this:



At this point, I need to add a Primary Key to the table. I prefer to use Surrogate Keys. (For a discussion of natural vs. surrogate keys, see my blog post: What is a Primary Key?)

Add surrogate key to PC table.

Adding a surrogate key to a table is as simple as opening it in Design View, adding an Autonumber Field, and making it the Primary Key.



Saving the table will automatically fill the autonumber field with values:



Step 2: Create Symptom_Temp table

As I showed in my post Querying Repeated Columns: Multiple IIFs it is necessary to combine multiple UNION queries with multiple IIF statements to create a list of Symptoms. Something like this:

SELECT Patient, IIf([cough],"Cough") AS Symptom
FROM Patient_RC WHERE [Cough] = True
UNION ALL
SELECT Patient, IIf([Sneeze],"Sneeze") AS Symptom
FROM Patient_RC WHERE [Sneeze] = True
UNION ALL
SELECT Patient, IIf([Fever],"Fever") AS Symptom
FROM Patient_RC WHERE [Fever] = True
UNION ALL
SELECT Patient, IIf([Body_Aches],"Body aches") AS Symptom
FROM Patient_RC WHERE [Body_Aches] = True
UNION ALL SELECT Patient, IIf([Nausea],"Nausea") AS Symptom
FROM Patient_RC WHERE [Nausea] = True
ORDER BY Patient, Symptom;


Note: You can't create or view this query in the Query Builder, however, you can create the first on (Cough) in the QB, then switch to the SQL View and copy and paste, modifying each for the specific column.

I still need to make this query into a Make-Table query. There are a couple of ways to do this. I can use a stacked query or all-in-one query.

Stacked Query

A stacked query is simply a query which uses another query in the FROM clause instead of a table. So if I save the above query as "qryTemp", I can use that query in the FROM clause of a Make-Table query:

SELECT Patient, Symptom INTO Symptom_temp
FROM qryTemp;

Or in the Query Builder:



All-in-one Query

It's also possible to it in a single query. To do that, I surround the SQL of qryTemp in parentheses and use it in the From clause of the Make-table. Like this:

SELECT Patient, Symptom INTO Symptom_temp
FROM (SELECT Patient, IIf([cough],"Cough") AS Symptom
FROM Patient_RC WHERE [Cough] = True
UNION ALL
SELECT Patient, IIf([Sneeze],"Sneeze") AS Symptom
FROM Patient_RC WHERE [Sneeze] = True
UNION ALL
SELECT Patient, IIf([Fever],"Fever") AS Symptom
FROM Patient_RC WHERE [Fever] = True
UNION ALL
SELECT Patient, IIf([Body_Aches],"Body aches") AS Symptom
FROM Patient_RC WHERE [Body_Aches] = True
UNION ALL
SELECT Patient, IIf([Nausea],"Nausea") AS Symptom
FROM Patient_RC WHERE [Nausea] = True
ORDER BY Patient, Symptom) AS Temp;


(Note: This cannot be done in the Query Builder)

The result (in either case) will be a table that looks like this:



Step 3: Remove Symptoms to Symptom Table

To fulfill 3NF, I need to create a query that removes the duplicate Symptom values into their own table, which I will call Symptom. The GROUP BY clause works well for that. As before, I'll use a Make-Table query to create a new table to hold the software values:

SELECT Symptom INTO Symptoms
FROM Symptom_temp
GROUP BY Symptom;


And as before, I'll add an Autonumber primary key, so the table looks like this:



Step 4: Creating the Linking Table: Patient_Symptoms

So now, I've got my two "One-Side" tables: Patient and Symptoms. Now I just need to create the linking table.

A "Linking table" is a mechanism by which Many-To-Many relationships are built in a relational database. (See: What is Normalizion Part V for more information.)

I can do that with a Make-Table Query and a join of Patient, Symptom_Temp, and Symptom:

SELECT CLng([Patient].[PatientID]) AS PatientID,
    CLng([Symptoms].[SymptomID]) AS SymptomID
INTO Patient_Symptoms
FROM (Patient INNER JOIN Symptom_temp
    ON Patient.Patient = Symptom_temp.Patient)
        INNER JOIN Symptoms
            ON Symptom_temp.Symptom = Symptoms.Symptom;


Or in the Query Builder:



Notice the expressions in the SQL Statement and the Query Builder: CLng([Patient].[PatientID]) and CLng([Symptoms].[SymptomID]). These are necessary because the Make Table query will attempt to create both PatientID and SymptomID as autonumber fields. Since a table cannot have two autonumber fields, this will give me an error.

Since they are both Foreign Keys, I don't want either to be autonumber. (A foreign key should NEVER be an autonumber.) So I use the CLng() function to convert them to the Long Integer datatype, which is the datatype which should ALWAYS be used to a foreign key to an autonumber primary key.

This will give me a table with two fields: PatientID and SymptomID. I'll also want to make these fields a compound primary key:



The final result looks like this:



The final step is to create the relationships between the three tables: Patient, Patient_Symptoms, and Symptoms.

Step 5: Create the Relationships

The easiest way to create relationships in Access is to use the Relationship Window. Add the four tables to the Relationships Window:



Click and drag PatientID from the Patient table to PC_ID in the Patient _Symptoms table. In the pop-up window, choose the Enforce Referential Integrity box:



And click Create. Do the same for SymptomID between Symptoms and Patient _Symptoms. The final result will look like this:





Next Time:
Next time, I’ll finish up by importing the spreadsheet to existing tables:
Normalizing Repeated Columns: Yes/No Fields (Part2)



No comments: