Friday, September 02, 2011

Primary keys from experience

From time to time I see articles in defense of natural primary keys against surrogate keys. I don't take an immovable stand on either side, as I have seen good cases for both. In general, I like the idea of a natural primary key, when I see one that it is really natural. Quite often, though, a natural primary key has proved itself faulty. And most of the times, when such faults happened, it was because of limited understanding of the data. I like to describe this kind of misinformation as data analysis culture clash. When choosing a natural primary key, one should consider which element, or group of elements, are unique in a given set. Not only that, they must be immutable, at least within that set. For example, in a group of people, we may assume that a combination of name, surname, date and place of birth is a good natural primary key. Well, no. It isn't, for several reasons. Read on for a few real cases. If we rule out the above combination, perhaps we could use the Social Security Number as a valid key? Not really. The fact is that the above assumptions work well if we consider only people from the same country in current times. If we extend our data definition to include people from different countries, or historical records, then the assumption collapses. A practical case: The birth place. This is a fair assumption. In combination with other elements (e.g.: date of birth, name, and surname) it can provide good basis for unique and immutable records. If you consider people in the United States today, you are right. Even in the UK, or Italy, or France. But try applying this method to places with recent political changes due to war or revolutions, and suddenly the name of the town may suddenly change. What was before VictoryBurg in Oppresslandia is now known as Heroes City in Freelandia. And this happens now, in the 21st century. If your set includes historical data, these occurrences may become frightfully frequent. Speaking of historical times, if you are dealing with really old records, you may want to consider that dates are less than immutable. The way we count days in most Western countries is called the Gregorian calendar, which was adopted by a handful of countries in 1582. Dates before October 5, 1582 and after October 14, 1582 are recorded using two different calendars. It is inconvenient, but if you know that you may avoid wrong date calculations. Simple, isn't it? Not really. If your records include data from different countries, you will have to take into account when this calendar was adopted. Just to give a few examples, England adopted the Gregorian calendar in 1752, Japan in 1873, Russia in 1918, Turkey in 1926. When dealing with people from different countries, you may be tempted to use citizenship as an immutable property. That may work, if you consider citizenship at birth and are prepared to keep the names of not currently existing states. If, instead, you get the citizenship from the employee's passport, you may incur in one or more of the following cases (all happened in practice during my work with an international organization):
  • Some employees came to work with the passport they had before the political change, and now that the country has split, they have different passports, depending on which side of the new border they live.
  • Some employees have parents from different countries and are entitled to more than one passport. They came to work initially with one passport, and came again a few year later with a different one because of different benefits.
  • Some employees started working with a given citizenship, and then they got a different one because of marriage, political asylum, or other legal means.
  • Some employees became stateless. The equivalent of NULL in a table field.
Coming to the Social Security Number, the assumption of uniqueness fails for the same reason that citizenship does. It's even more frequent, because of people working abroad and paying their taxes in a foreign country, so they had two or more SSN or equivalent to show. But this element failed for another reason, i.e. because it is not immutable. It is supposed to be, but in some cases it happens that, due to clerical mistakes, the SSN issued in a given country is wrong, and needs to be changed. It happens in Italy, where your SSN (called "codice fiscale", or fiscal code) must match your name, surname, place, and date of birth. If any of these elements was wrong when the code was generated, the code needs to be done again. It's a painful process that requires a court order, but the result is that the item is not immutable. Other elements that I have seen used wrongly for primary keys, either standalone or as key components, are: telephone numbers (they can change, and after being changed they can be assigned to other people), email addresses (they are unique, but they can easily change, or they can be abandoned when the person changes provider, or company, or both), gender (it can change), surname (it can change, legally, for male and female, depending on country and conditions). Summing up, a sane amount of skepticism should be used when considering if an element can be used in a primary key. Depending on the environment, the element can be safe or it can become a nightmare when the database grows from a neighborhood business to an international venture.

7 comments:

Roland Bouman said...

Hi Giuseppe,

interesting topic, with a lot of angles.

Except for few exceptions, I tend to choose autoincrementing integer surrogate keys, even for intersection tables (why I do that is the topic of another discussion, but I have a good reason). Of course I do always add unique constraints for non-primary key columns if the data implies it.

Anyway, I want to comment on immutability of primary keys.

There is to the best of my knowledge no logical requirement for a primary key to be immutable. It is at least not a requirement of the relational model, and provided the RDBMS supports foreign key constraints with cascading updates, then mutating the primary key will not affect data integrity. However, it's extremely convenient when can rely on the primary key to be immutable. I think the main advantage is that in the case of immutable primary keys, data exchange depends only on the current data set. If we have to take mutable primary keys into account, we can still have data exchange but only if we have a log of all changes so we can detect and handle primary key updates.

Anyway, my point is that even though most database designers take it for granted that primary keys are immutable, it is not a rule, and it's not impossible to have a consistent system that allows mutable primary keys. So if you have to work on some data exchange with an existing system, one cannot simply assume the primary keys are immutable, this is an assumption that should be verified, because it will affect the design of the data exchange enormously.

datacharmer said...

Roland,
Thanks for your comment.
Indeed, immutability is not a requirement of the relational model. It is, however, a practical requirement. Changing an existing primary key will often require a cascade update on other tables. This is bearable (especially if your database supports referential integrity and automatic cascade updates). But the net effect is that the change may invalidate official documents that have been printed using the then valid and now invalidated database contents.

rudy said...

"In general, I like the idea of a natural primary key, when I see one that it is really natural."

me too ;o)

to be fair, you should now write a second article, in which you analyze, to the same level of detail and and with as many nuances as in this article on problems with natural keys for persons (a most excellent article by the way), a case study of a natural primary key that is really natural

and i would very much like to see roland's article about the autoincrement for an intersection table

Roland Bouman said...

Hi Rudy,

"i would very much like to see roland's article about the autoincrement for an intersection table"

I'm not sure it would jusify a complete blog post of itself, but I'm happy to elaborate here.

When I'm just modeling, and not worrying about any particular implementation details (such as that some storage engines use clustered indexes and the choice of primary key has an effect on database performance), I work with these assumptions:

1) In many of cases, we know up front we won't be able to find a natural key.
2) In the cases where we don't know for sure, the time to do the research required to find a natural key is just not worth it.
3) If there happens to exist a natural key, nothing will be lost if we use a surrogate key, as long as we add the necessary unique constraint to prevent duplicates.

So it follows we will have surrogate keys not just incidentally, but a lot of times. This means our foreign keys will also usually point to single column surrogate keys.

Now we get to the intersection tables. If we accept #1 and #2, it seems we can know in advance that these objections do not hold for our intersection tables: we know that by definition, the combined foreign keys in the intersection table should be unique and not null, and because we ensure the primary keys of the referenced tables are surrogate keys, we already know that those keys will be stable. Thus the "natural" key (if you'd want to call a combination of two surrogate keys that) of the intersection table will be stable too.

However, I still prefer to give the intersection a (redundant) extra column to hold a surrogate primary key (but of course, a proper unique constraint will guard uniqueness of the combined parent keys). The reason is that whenever a table needs to reference the intersection table (and this always pops up when you don't have time to deal with schema changes), I would need to either design a multi-column foreign key, or refactor the table at that time to add the surrogate primary key. Choosing for the multi-column foreign key is not something I really like to do, for several reasons: first, it would be awkward and out of the ordinary since all other foreign keys just point to a single surrogate primary key; second, if that referencing table is itself an intersection table, we would now have a three or four column foreign key.

Now, I am well aware that there are no logical reasons to resent multi-column keys and foreign keys, so there is no logical reason for me to shun them. But when I kick in my initial assumption 3, "nothing is lost when you do use surrogate keys", then it suddenly seems silly to pedantically cling on to "natural" keys all togehter. On the contrary, by always using surrogate primary keys, and always having your foreign keys referencing those surrogate primary keys, the schema as whole becomes much more resilient to schema changes. In addition, queries and joins will always use the same, uniform pattern to relate table to one another, which makes it much more fool-proof to write queries against it.

So my reasons are not theoretical, it's just what I feel is more convenient and practical.

I hope this helps. Of course, I am very interested in hearing comments against my ways. I am always eager to learn from other people's experiences.

rudy said...

thanks for the explanation, roland, and it does make sense, sorta ;o)

but i am not convinced that an intersection table needs a surrogate

under what circumstances would an intersection table have a child table? (when it has a one-to-many relationship, he said, answering himself)

but there is no way you're going to add a row to this child table without knowing what the values of the two FKs in the intersection table are... i mean, how did you possibly navigate to the correct intersection row? surely not via its surrogate...

NERv said...

Another thing to consider when using timestamps (courtesy of http://googleblog.blogspot.com/2011/09/time-technology-and-leaping-seconds.html):

"[...] fluctuations in Earth’s rotational speed mean that even very accurate clocks, like the atomic clocks used by global timekeeping services, occasionally have to be adjusted slightly to bring them in line with “solar time.” There have been 24 such adjustments, called “leap seconds,” since they were introduced in 1972. Their effect on technology has become more and more profound as people come to rely on fast, accurate and reliable technology."

rob squire said...

Hi
What do you suppose the primary key (or indeed the unique key for roland) would be for the intersection table between people and addresses - especially if the intersection table also had timestamps (from and to) on it to hold the history of who lived where and when they lived there?
Regards
Rob

Vote on Planet MySQL