Tuesday, October 5, 2010

Normalizing City, State, and Zip


Recently, I ran into a question on the internet about normalization that I thought would be good to repeat.

THE QUESTION:

 I'm toying with the idea of starting a new project, so I'm in brainstorming mode for table design. I'll be recording customer information in this application. Typical stuff: First and Last Names, Company, Street, Apt, City State and Zip, Phone numbers(s) and extensions, E-mail.

How do you guys recommend setting up the tables for City State and Zip? I was thinking that I would have:

TBL_General_State
PKStateID
StateAbbr (Limited to 2 letters)
StateName


TBL_General_City
PKCityID
FKStateID (Lookup to TBL__State)
CityName


TBL_General_Zip
PKZipID
FKCityID (Lookup to TBL__City
ZipCode


My customer information then would record only the zip code (PKZipID). And I could then use queries for the state, city, and zip information for forms, reports, etc.

Or is this beyond overkill?

ANSWER:

 By strict normalization theory, having City, State, and Zip in the same table violates the 3rd Normal Form because there are functional dependencies between those fields. However, functional dependencies are not all the same. There are strong dependencies and weak dependencies.

A strong dependency is one in which the value of a dependent field MUST be changed if another field is changed. For instance, suppose I have Quantity, Price, and ExtendedPrice, where ExtendedPrice is a calculation of the other two. If I change either Quantity or Price, the ExtendedPrice MUST be changed.

A weak dependency is one in which the value of a dependent field MAY be changed if another field is changed. City, State, and Zip are examples of weak dependencies. If I change a person's city, I may not have to change their state. They may have moved within the same state. Likewise, if I change the state, I may not have to change the city. There is, after all, a Grand Rapids, Michigan and Grand Rapids, Minnesota. The relationship between city and zip is even more complicated.

Now, it is possible to represent these fields in a fully normalized fashion, but I contend that it is more trouble for very little gain. There are two main reasons for normalizing data: minimize redundant data and maximize data integrity. Both of these can be achieved by using lookup tables for City and State without trying to represent the relationship between the two. A zip code could be mis-typed, of course, but it could also be mis-selected from a list, so to my mind there's no real reason to have a lookup table.

If you did normalize these fields, you could have a selection process that would present all possible combinations of values if you selected the City. For instance, if you had a combo box for City, you could have cascading combo boxes to select only the appropriate States and Zip codes. But it would be just as easy to mis-select the right value from this list as it would be to mis-select from independent lookup tables. And, of course, you'd have to create and maintain these relationships.

Therefore, normalizing City, State, and Zip adds a complication to your data model for very little gain, and in my opinion, is a good example of when to denormalize.

.

3 comments:

AsherO said...

Keep in mind that some zip codes have multiple city names associated with them (usually in rural areas where a zip code covers a few towns). In such instances, by storing just the zip code, you're potentially risking having the wrong city for the zip code.

Roger Carlson said...

AsherO,

While that is true, I don't think the risk is any greater than if you select city and zip independently. In addition, the "cost" of maintaining the relationship is high.

As I said, it's more trouble than it's worth.

Anonymous said...

Sorry about showing up so late to the party, but zip is even *more* complicated than that if you want to cover all the edge cases: zip codes can span state boundaries, as well, so a city-state lookup may not properly cover rural areas where the nearest post office is in a different state than the address served. This just makes it even more difficult to remove zip code from the database.