Some Oracle Papers

Imported from Google docs

Oracle in-list bind

Posted by pachot on July 18, 2008

Oracle and JDBC
Array binding for a statement with IN list
version 1.0

Franck Pachot
contact@pachot.net

http://docs.google.com/Doc?id=ddbk8jd6_385crzqtdgq

Introduction

Usage of bind variable instead of literals is a key to scalability.
However, we have no mean to use bind variables for a statement like the following:

SELECT ename,empno FROM emp where empno in (7369,7566,7782);

Especially when we have a variable number of elements in the list.

This document will show how to use an oracle collection as a bind variable, whith an example in Java.

Description

Nested Table

Oracle has several 2 types of collections that can be used in a SQL statement: nested tables and VARRAYS.
Here we will use a nested table to have a one dimension collection of unbounded homogeneous elements. This is a database object created with CREATE TYPE.

As we want a list of integers, we will use:
CREATE OR REPLACE TYPE NUMBER_LIST_TYPE AS TABLE OF number;

To use that collection, our statement is a bit different from a IN ( value, value , … ):
We will need to

  1. cast the array that comes from jdbc to our collection type
  2. transform our collection to a sql table
  3. change our IN(list) to a IN(subquery)

So we transform the following query

SELECT ename,empno FROM emp where empno in (7369,7566,7782);

in:

SELECT ename,empno FROM emp
WHERE empno IN (
SELECT * FROM TABLE( CAST ( ? as NUMBER_LIST_TYPE ) )
)

JDBC array

To be able to transform a java array ( such as Int[] in our example ) into a bind variable that can be used as our nested table type, we will:

  1. get a descriptor for our nested table: oracle.sql.ArrayDescriptor. It is created once (as the statement is parsed once).
  2. create an array that is a jdbc datatype: oracle.sql ARRAY. It is created with the descriptor, and the java array.
  3. we will the use PreparedStatement.setObject() to bind our variable for each execution.

Example

We suppose that the EMP table is already created
We have one sql file for the DDL that creates our collection type:

Test.sql:

CREATE OR REPLACE TYPE NUMBER_LIST_TYPE AS TABLE OF number;
/

And our java source code that

  1. connects to the database
  2. parse the statement and describe the collection
  3. executes 2 times the statement with a different list.

Test.java:

import java.sql.*;
import oracle.sql.ArrayDescriptor;
import oracle.sql.ARRAY;
import oracle.jdbc.*;

/*
The following DDL must be run to create the oracle collection:
CREATE OR REPLACE TYPE NUMBER_LIST_TYPE AS TABLE OF number;
*/

public class Test{

static public void main(String[] args) throws SQLException {

/* CONNECTION ******************************************************************************* */

DriverManager.registerDriver(new oracle.jdbc.driver.OracleDriver());
Connection conn = DriverManager.getConnection(“jdbc:oracle:thin:@host:port:sid”,”SCOTT”,”TIGER”);

/* PARSING ********************************************************************************** */

/* The oracle collection is described */
ArrayDescriptor oracleCollection = ArrayDescriptor.createDescriptor(”NUMBER_LIST_TYPE“,conn);

/*
The sql statement uses IN ( subselect ) syntax.
the input parameter is cast to our table type (NUMBER_LIST_TYPE)
then transformed as a table so we can select on it
*/
PreparedStatement stmt = conn.prepareStatement(
SELECT ename,empno FROM emp
+” WHERE empno IN (
+” SELECT * FROM TABLE( CAST ( ? as NUMBER_LIST_TYPE ) )
+” ) “);

/* EXECUTION 1 ****************************************************************************** */

System.out.println(”1st execution:”);

/* define our java array */
int[] javaArray1 = { 7369,7566,7782 };

/* define our oracle array */
ARRAY jdbcArray1 = new ARRAY ( oracleCollection , conn , javaArray1 ) ;

/* bind that array to our statement bind variable */
stmt.setObject(1,jdbcArray1);

/* execute the query and browse the result */
ResultSet r=stmt.executeQuery(); while(r.next()){ System.out.println(”t”+”t”+r.getString(2)+”: “+r.getString(1)); }

/* EXECUTION 2 ****************************************************************************** */

System.out.println(”2st execution:”);

int[] javaArray2 = { 7900,7902 };
ARRAY jdbcArray2 = new ARRAY ( oracleCollection , conn , javaArray2 ) ;
//stmt.setObject(1,jdbcArray2,OracleTypes.ARRAY);
stmt.setObject(1,jdbcArray2);

/* execute the query and browse the result */
r=stmt.executeQuery(); while(r.next()){ System.out.println(”t”+”t”+r.getString(2)+”: “+r.getString(1)); }

/* END */

stmt.close();
conn.close();

}
}

The output is:

1st execution:
7369: SMITH
7566: JONES
7782: CLARK
2st execution:
7900: JAMES
7902: FORD

the statement has been parsed only once, and can be executed several times with a different array of values.

Remarks

  • We used ‘SELECT * FROM TABLE()’. The name of the column returned by a nested table type is: COLUMN_VALUE
  • We can use an array of character strings (String[]). The collection will be a TABLE OF VARCHAR2(…)
    Note that I had a problem when not having nls_charset12.jar in the CLASSPATH (all strings where null)
  • That is an introduction of passing arrays, structures objects between java and a SQL statement, or a stored procedure.

References

All tests and documentation that helped building this document were related to 10gR2 Oracle version.
All comments are welcome at
contact@pachot.net

Posted in Uncategorized | Tagged: | Leave a Comment »

Recover a deleted datafile from unix when database is still up

Posted by pachot on April 8, 2008

Recover a deleted datafile from
unix/linux when database is still up

version 1.0

Franck Pachot
contact@pachot.net
http://docs.google.com/Doc?id=ddbk8jd6_37gs3h694p

On Unix/Linux, when a file is deleted, but a process still has the file open, the file is still there in the filesystem, and only the inode is removed.
But the process can continue to use its file handle, and the file can also be accessible under /proc//fd .

In the following example, we use that behaviour to recover a lost datafile after is has been dropped from the os (with rm) but the datafile is still open by the background processes.

First, we create a tablespace, and populate a table in it.

SQL> REM we create a tablespace:
SQL> create tablespace TEST_RM datafile ‘/var/tmp/test_rm.dbf’ size 10M;
Tablespace created.

SQL> REM we create a table in it:
SQL> create table FRANCK tablespace test_rm as select * from dba_objects;
Table created.

SQL> REM we check that table data is accessible:
SQL> select count(*) from FRANCK;
COUNT(*)
———-
12708

SQL> exit
Disconnected from Oracle Database 10g Express Edition Release 10.2.0.1.0 – Production

Then, we drop the datafile from unix prompt.
here is the datafile
ls -l /var/tmp/test_rm.dbf
-rw-r—– 1 oracle dba 10493952 Mar 26 14:25 /var/tmp/test_rm.dbf

we ‘accidently’ drop the datafile
rm /var/tmp/test_rm.dbf
ls -l /var/tmp/test_rm.dbf
ls: /var/tmp/test_rm.dbf: no such file or directory


Here the datafile is lost.
Now we connect again.

sqlplus / as sysdba

Connected to:
Oracle Database 10g Express Edition Release 10.2.0.1.0 – Production

SQL> REM and we check if table data is accessible:
SQL> select count(*) from FRANCK;

select * from franck
*
ERROR at line 1:
ORA-01116: error in opening database file 5
ORA-01110: data file 5: ‘/var/tmp/test_rm.dbf’
ORA-27041: unable to open file
Linux Error: 2: No such file or directory
Additional information: 3

The datafile is lost and data is not accessible.

However, the datafile should still have an open file descriptor by an oracle background process

we check the dbwriter pid:
ps -edf | grep dbw
oracle 2661 1 0 Mar25 ? 00:00:06 xe_dbw0_XE
oracle 7044 7037 0 14:40 pts/1 00:00:00 /bin/bash -c ps -edf | grep dbw
oracle 7046 7044 0 14:40 pts/1 00:00:00 grep dbw

and we check its opened file descriptors for our file:
ls -l /proc/2661/fd | grep test_rm
lrwx—— 1 oracle dba 64 Mar 26 14:02 66 -> /var/tmp/test_rm.dbf (deleted)

here it is:
ls -l /proc/2661/fd/66
lrwx—— 1 oracle dba 64 Mar 26 14:02 /proc/2661/fd/66 -> /var/tmp/test_rm.dbf (deleted)

In some other unix, lsof may be needed to map the file descriptor with the deleted file name

first we set a symbolic link so that oracle can see it as it was before the delete:
ln -s /proc/2661/fd/66 /var/tmp/test_rm.dbf

here data is accessible, but that will be lost if dbwriter closes it file handle (i.e if the database is closed)

However we can now set the tablespace read only so that it is checkpointed, and no writes occurs on it.

SQL> alter tablespace TEST_RM read only;
Tablespace altered.

We can now copy the file safely.

then we drop the symbolic link:
rm /var/tmp/test_rm.dbf
ls -l /var/tmp/test_rm.dbf
ls: /var/tmp/test_rm.dbf: No such file or directory
and we can now copy the file
cp -p /proc/2661/fd/66 /var/tmp/test_rm.dbf
ls -l /var/tmp/test_rm.dbf
-rw-r—– 1 oracle dba 10493952 Mar 26 14:54 /var/tmp/test_rm.dbf

And datafile is now available again.

SQL> REM we have it back, lets put the tablespace back in read/write
SQL> alter tablespace test_rm read write;
Tablespace altered.

SQL> REM and we check data is still there:
SQL> select count(*) from FRANCK;

COUNT(*)
———-
12708

This is not to be used like that in production. This is unsupported and may behave differently on different unix/linux or oracle versions.

All tests and documentation that helped building this document were related to 10gR2 Oracle and Linux 2.6 versions .
All comments are welcome at
contact@pachot.net

Posted in backup/recovery | Tagged: , , | Leave a Comment »

Oracle Begin Backup Description

Posted by pachot on April 8, 2008

Oracle Backup and Recovery
Description of Begin backup/End backup
version 1.1

Franck Pachot
contact@pachot.net

http://docs.google.com/Doc?id=ddbk8jd6_37gs3h694p

Introduction

This document tries to explain exactly what happens when using
ALTER TABLESPACE … BEGIN BACKUP and ALTER TABLESPACE … END BACKUP,
and why it is mandatory to use it when the online backup is done with a tool that is external to Oracle ( such as OS backups using cp, tar, BCV, etc. )

It also gives an answer to those frequent questions:

  • Does Oracle write to data files while in hot backup mode ?
  • What about ALTER DATABASE BEGIN BACKUP ?
  • Why it is not used with RMAN backups
  • What if you do an online backup without setting tablespaces in backup mode ?
  • What if the instance crashes while the tablespaces is in backup mode ?
  • How to check which datafiles are in backup mode
  • What are the minimal archive logs to keep with the hot backup ?
  • Why use OS backups instead of RMAN ?
  • Why BEGIN BACKUP takes a long time ?

Description


Offline backup (Cold backup)

A cold OS backup is simple: the database has been cleanly shut down (not crashed, not shutdown abort) so that:

  • all datafiles are consistent (same SCN) and no redo is needed in case of restore
  • the datafiles are closed: they will not be updated during the copy operation

Thus, it can be restored entirely and the database can be opened without the need to recover.

Online backup (Hot backup)

An hot backup does the copy while the database is running. That means that the copy is inconsistent and will need redo applied to be usable.
Recovery is the process of applying redo log information in order to roll-forward file modifications as they were done in the original files.

When the copy is done with Oracle (RMAN), Oracle copies the datafile blocks to backupset so that it will be able to restore them and recover them.

When the copy is done from the OS (i.e with a tool that is not aware of the Oracle file structure), several issues come up:

  • Header inconsistency: Nothing guaranties the order in which the files are copied, thus the header of the file may reflect its state at the beginning or at the end of the copy.
  • Fractured blocks: Nothing guaranties that an Oracle block is read in one single i/o so that two halves of a block may reflect its state at two different points in time.
  • Backup consistency:As the copy is running while the datafile is updated, it reads blocks at different point in time. The recovery is able to roll forward blocks from the past, but cannot deal with blocks from the future, thus the recovery of the copy must be recovered at least up to the SCN that was at the end of the copy.

So it is all about consistency in the copy: consistency between datafiles, consistency within datafiles and consistency within data blocks, and keep this consistency in the current files (obviously) as well as in the copy (as it will be needed for a restore/recovery)


Backup mode

The goal of ALTER TABLESPACE … BEGIN BACKUP and ALTER TABLESPACE … END BACKUP is to set special actions in the current database files in order to make their copy usable, without affecting the current operations.

Nothing needs to be changed in the current datafiles, but, as the copy is done by an external tool, the only way to have something set in the copy is to do it in the current datafiles before the copy, and revert it back at the end.

This is all about having a copy that can be recovered, with no control on the program that does the copy, and with the minimal impact on the current database.

In order to deal with the 3 previous issues, the instance that will do the recovery of the restored datafiles has to know:

  • that the files need recovery
  • from which SCN, and up to which SCN it has to be recovered at least
  • enough information to fix fractured blocks

During backup mode, for each datafile in the tablespace, here is what happens:

1- When BEGIN BACKUP is issued:

  • The hot backup flag in the datafile headers is set, so that the copy is identified to be a hot backup copy.
    This is to manage the backup consistency issue when the copy will be used for a recovery.
  • A checkpoint is done for the tablespace, so that in case of recovery, no redo generated before that point will be applied.
    Begin backup command completes only when checkpoint is done.

2- During backup mode:

  • The datafile header is frozen so that whenever it is copied, it reflects the checkpoint SCN that was at the beginning of the backup.
    Then, when the copy will be restored, Oracle knows that it needs to start recovery at that SCN to apply the archived redo logs.
    This is to avoid the header inconsistency issue.
    That means that any further checkpoints do not update the datafile header SCN (but they do update a ‘backup’ SCN)
  • Each first modification to a block in buffer cache will write the full block into the redo thread (in addition to the default behaviour that writes only the change vector).
    This is to avoid the fractured block issue. There may be a fractured block in the copy, but it will be overwritten during the recovery with the full block image.

That means that everything goes as normal except for two operations:
– at checkpoint the datafile header SCN is not updated
– when updating a block, the first time it is updated since it came in the buffer cache, the whole before image of the block is recorded in redo
– direct path writes do not go through the buffer cache, but they always write full blocks and then full block is written to redo log (if not in nologging)

3- When END BACKUP is issued:

  • A record that marks the end of backup is written to the redo thread so that if the copy is restored and recovered, it cannot be recovered earlier than that point. This is to avoid the backup consistency issue.
  • The hot backup flag in the datafile headers is unset.
  • The header SCN is written with the current one.

Remarks:

  1. the fractured block is not frequent as it happens only if the i/o for the copy is done at the same time on the same block as the i/o for the update. But the only mean to avoid the problem is to do that full logging of block for each block, just in case.
  2. if the OS i/o size is multiple of the Oracle block size (e.g backup done with dd bs=1M), that supplemental logging is not useful as fractured blocks cannot happen.
  3. the begin backup checkpoint is mandatory to manage the fractured block issue: as Oracle writes the whole before image of the block, it needs to ensure that it does not overwrite a change done previously. With the checkpoint at the beginning, it is sure that no change vector preceding the begin backup has to be applied be applied.
  4. The supplemental logging occurs when accessing the block for the first time in the buffer cache. If the same block is reloaded again in the buffer cache, supplemental logging will occur again. I haven’t seen that point documented, but a testcase doing a ‘flush buffer_cache’ proves that.


Consequence on the copy (the backup)

When the copy has been done between begin backup and end backup, the copy is fully available to be restored and recovered using the archive log files that where generated since the begin backup.
After the files have been restored, Oracle sees that the SCN is older than the current one and says that the database needs recovery.

The recovery must be done up to a point in time ulterior to the end backup point in time so that we are sure that there is no blocks in the datafile that comes from the future.

Consequence on the current database

All operations can be done during the backup mode.
However, as more logging is written, it should be done during a low activity period. And for the same reason, it is better to do the tablespaces one after one instead of putting all the database tablespaces in backup mode.

In addition, it is not possible to shutdown the database while a tablespace is in hot backup.
This is because, as the datafile header is frozen with a non current SCN, the datafile would be seen as if it requires recovery.

However that cannot be avoided if the instance crashes (or shutdown abort), and then the startup of the database will raise:
ORA-1113: file … needs media recovery
This is the only case I know where instance recovery is not automatic, you need to issue ‘alter database… end backup;‘ before opening the database.


Frequent questions


Does Oracle write to data files while in hot backup mode ?

Yes of course, it would not be called ‘online’ backup if it were not the case.

What about ALTER DATABASE BEGIN BACKUP ?

That command put all database tablespaces in backup mode at the same time.
As seen previously, it is a bad idea to put all tablespaces in backup mode, as it is better to do it one by one in order to minimize the supplemental redo logging overhead.

Oracle introduces this ’shortcut’ for one reason only: when doing backup with a mirror split (BCV, Flashcopy, etc), the copy gets all the datafiles at the same time, and the copy lasts only few seconds. In that case, it is easier to use that command to put all tablespaces in backup mode during the operation.

Why it is not used with RMAN backups

RMAN is an Oracle tool, that is totally aware of the datafile structure, and the way they are written.
Then, it knows how it can read the datafiles in a way the copy is consistent: write the good version of datafile header, read the blocks with an i/o size that is multiple of the Oracle block size so that there is no fractured blocks, and check head and tail of the block to see if block is fractured (in that case, it re-reads the block to get a consistent image).

That is one advantage among many others of using RMAN for backups.

What if you do an online backup without setting tablespaces in backup mode ?

If you don’t put the tablespace in backup mode, we can’t be sure that the copy is recoverable. It may be fine, but it may have inconsistencies.

We can suppose that the copy is consistent if we make the copy with the following conditions

  • Header inconsistency: If the file copy is done from beginning to end, then the datafile header should reflect the right SCN
  • Fractured blocks: If the copy does i/o with a size that is multiple of the Oracle block size, then you should not have fractured blocks
  • Backup consistency:If you take care to recover later than the point in time of the end of the copy, you should not have inconsistency

But there may be other internal mechanisms that are not documented so that we can’t be sure that this list of issues is exhaustive.
And, as it is not supported, we cannot rely on a backup done like that. Note that you will have no message

What if the instance crashes while the tablespaces is in backup mode ?

When you start the database after that, Oracle will say that it requires recovery. This is because the SCN was frozen and it is the expected behaviour because if you restore the copied file, it has to be recovered. (and the only way Oracle has to set its value in the copy is to set it in the current file while it is copied)

In that case you can can issue:

ALTER DATABASE END BACKUP;ALTER DATABASE OPEN;

to open the database.

But your backup is not usable, you have to do it again.

How to check which datafiles are in backup mode

The V$BACKUP view shows the datafiles that are currently in backup mode (status ACTIVE).

Some old documentation says to check V$DATAFILE_HEADER column FUZZY.

This is because in previous versions (<9i) the begin backup unsets the online fuzzy bit in the datafile header, and set it back at when end backup is issued.

Since 9i, the online fuzzy bit is unset only when datafile is offline or read-only, not for backup mode.

What are the minimal archive logs to keep with the hot backup ?

The backup done online is unusable if there is not at least the possibility to restore archive logs
– from the archive log that was the current redo log when the backup started,
– up to the archive log that was archived just after the backup (of the whole database) ended.

That is sufficient to do an incomplete media recovery up to the point of ‘end backup’.
Subsequent archive logs will be needed to recover up to the point of failure.


Why use OS backups instead of RMAN

The best way to do online backups is using RMAN as it has a tons of features that you cannot have with OS backups.

Yet, the OS backup are still used when using OS tools that can copy an entire database in seconds, using mirror split (BCV, Flashcopy, etc), for very large databases.


Why BEGIN BACKUP takes a long time ?

BEGIN BACKUP has to checkpoint the dirty buffers related with the tablespace, so that the backup datafiles and the folowing redo log are sufficient to recover.
This is the same reason why an offline backup is done after a normal shutdown (that do a checkpoint) and not a shutdown abort, but here it is at tablespace level only.
The duration of that checkpoint is proportional to the buffer cache size and the number of datafiles.
The performance of BEGIN BACKUP checkpoint has been improved in 10g.

References


All tests and documentation that helped building this document were related to 10gR2 Oracle version.
All comments are welcome at
contact@pachot.net

Posted in Uncategorized | Tagged: | Leave a Comment »

Oracle surrogate key

Posted by pachot on March 26, 2007

Oracle Data Modeling
Choosing natural key or surrogate key
version 0.8

Franck Pachot
contact@pachot.net
See last version here: http://docs.google.com/Doc?id=ddbk8jd6_403htj9p85p

Introduction

Choosing the primary key is an important data modeling step. Choosing between the two alternatives (natural or surrogate key) is something often discussed between experts.
There are several reasons to choose for one or for the other.
Some people encourage to choose one alternative for the whole data model, such as ‘always use natural keys’ or ‘always introduce a surrogate key’ and discussions about that often tend to be ‘religious’.

In this paper, I will study all the pros and cons that I’ve seen given as a reason for the choice. I will take each one, give an example on a simple case, try to conclude for a ‘rule’ that helps to decide, and point to the consequences.

Because the reader may not want to read one more paper on natural vs. surrogate key, I’ll first summarize my conclusions:
- there is no general rule, neither ‘always use natural keys’, nor ‘always introduce surrogate keys’
- the choice must be done on a case by case basis: it depends on the table, the volume, the use case, the performance criteria, etc.
- the choice may be straightforward, or can necessitate a compromise (usually between performance and maintainability)
- when the choice is done, several Oracle features can help to limit the drawbacks of the chosen alternative.
- the difficulty to identify a natural key may come from an incomplete functional analysis, and a surrogate key sometimes hides a business concept that was forgotten.

We will analyze each of the arguments on several cases related with a simple data model, showing the pros and cons of the 2 alternatives, and the conclusion will help to understand the advantages or disadvantages, in order to do the right choice in front of a specific situation.


Definitions

From the functional specification to the physical design.

The functional analysis describes the objects that are manipulated by the business: Business Objects – in OOA (Object Oriented Analysis) – or Entities .

They are modeled as a Computer Independent Model: the Domain Model – in OOA (Object Oriented Analysis) – or the Conceptual Data Model.

The design will transform those specification to an IT specification, the Platform Independent Model: the Design Class Diagram – in OOD (Object Oriented Design) - or the Logical Data Model.

The implementation of that model within a relational database transforms it to the Platform Specific Model: the Physical Data Model (and the Object-Relational Mapping in OOD)

Business Entities (Business Objects or Business Concepts)

The business object specification gives a definition of the objects, their attributes, their relationsips.
The definition may include the attributes that identify an instance of the business concept, and those identifiers are usually be part of the business object definition.

Logical model

Those business concepts become entities in the logical model (or UML Classes in design model).
The identifying attributes become the candidate key(s).

Physical model

When implementing that model in relational database, we need to define:
- one primary key per table (a table implements entities and relationships)
- alternate keys to enforce uniqueness for other candidate keys
- foreign keys referencing a primary key to enforce referential integrity

Primary key

The chosen primary key can be either:

– one of the candidate keys (coming from the identifying attributes).

It is often called ‘natural key‘ but we can also encounter the terms:

‘business key’ as it is composed of business attributes
‘intelligent key’,'meaningful key’ as it has a meaning for end-users

– or one attribute created at design, generated at insert time, for each row, if we don’t want to use a candidate key to identify rows and enforce referential integrity.

It is often called ‘surrogate key‘ because it replaces a natural key, but we can also encounter the terms:
‘technical key’,’system-generated key’ as it is introduced by the implementation
‘non-intelligent key’,'meaningless key’ as it has no meaning for the end-user
‘internal key’ as it has no meaning outside the system
’synthetic key’,'artificial key’ as opposed to natural key

A natural key is a business concept that is used technically to implement referential integrity in relational databases.
It is unique, always defined, and immutable, all that within the scope of the business domain (must be the same for the same object across several databases or applications) .

A surrogate key is a technical value only, that have no meaning to business, and is not visible to them.
It is unique, always defined, and
immutable, all that within the scope of the  IT system (cannot be used across several databases or applications).

Tables that implement the relationships (many-to-may) includes the primary keys from both tables, and the whole is a composite primary key for the association table.


Example (sudy case)

A simple model example

First, lets introduce a simple data model, having two entities, where the application manages the internal phone calls within a company. The business has identified 2 objects (entities):

Phone lines:

Each phone line is identified by its number, that is composed with a prefix number (depending on the company site), and and extension number.
the phone line is assigned to one user

Calls:
A call is identified by the originating phone line and the timestamp when the call starts.
The destination
phone line and the timestamp when the call starts also identifies a call uniquely.

We record the call duration

Here is the UML domain model:

We use the <> stereotype to show which attributes or association identifies an instance of the class.
It is known as an ‘identifying relationship’ as in a relational model either an attribute or a relationship can be part of the key.
From that specification, we will study 2 possible physical data models: one with only natural keys, and one with surrogate keys for all entities
Our naming convention prefixes the foreign keys with the name of the association role
.
NATURAL KEY MODEL:
SURROGATE KEY MODEL:
Here is the DDL to create the physical data models:

/* NATURAL KEY MODEL */
create table PHONELINES (
PREFIX_NO varchar2(2),
EXTENSION_NO varchar2(4),
USER_NAME varchar2(60),
primary key (PREFIX_NO,EXTENSION_NO)
);
create table CALLS (
PHONELINE_PREFIX_NO  varchar2(2) ,
PHONELINE_EXTENSION_NO varchar2(4) ,
CALL_START_DATETIME date,
PARTNER_PREFIX_NO varchar2(2) ,
PARTNER_EXTENSION_NO varchar2(4) ,
DURATION number,
primary key(PHONELINE_PREFIX_NO,PHONELINE_EXTENSION_NO,CALL_START_DATETIME),
unique(PARTNER_PREFIX_NO,PARTNER_EXTENSION_NO,CALL_START_DATETIME),
foreign key(PHONELINE_PREFIX_NO,PHONELINE_EXTENSION_NO) references PHONELINES,
foreign key(PARTNER_PREFIX_NO,PARTNER_EXTENSION_NO) references PHONELINES
)
/* SURROGATE KEY MODEL */
create sequence PHONELINE_SEQUENCE;
create table PHONELINES
PHONELINE_ID number primary key,
PREFIX_NO varchar2(2),
EXTENSION_NO varchar2(4),
USER_NAME varchar2(60),
unique(PREFIX_NO,EXTENSION_NO)
);
create sequence CALLS_SEQUENCE;
create table CALLS
CALL_ID number primary key,
PHONELINE_ID number references PHONELINES(PHONELINE_ID),
PARTNER_PHONELINE_ID number references PHONELINES(PHONELINE_ID),
CALL_START_DATETIME date,
DURATION number,
unique(PHONELINE_ID,CALL_START_DATETIME),
unique(PARTNER_PHONELINE_ID ,CALL_START_DATETIME)
);

Consequences on natural or surrogate key choice for different use cases

Base on that example, we will study the consequences of the natural/surrogate key models for different points of view, and different use cases

1- OLTP queries efficiency:

REQUIREMENT:

We need to quickly find the call history for a phone line for example:’Find calls from phone 31131

NATURAL KEY MODEL:

The query will be:
select * from CALLS where PHONELINE_PREFIX_NO=99 and PHONELINE_EXTENSION_NO=1234

We only access to the calls table, without any joins, because all information is in the call table.

SURROGATE KEY MODEL:

The query will be:
select s.PREFIX_NO , s.EXTENSION_NO , c.* from CALLS c , PHONELINES p
where c.PHONELINE_ID=p.ID and p.PREFIX_NO=99 and p.EXTENSION_NO=1234

We need a join here: the query is more complex, execution is more expensive

CONCLUSIONS:

When using natural keys, as they have a business meaning, the parent key information is available within the child table.
Using surrogate key requires one more join: sql queries are more complex, and less efficient.

=> If not needed for other reasons, the rule is to avoid unnecessary surrogate keys here.
=> If a surrogate key is however chosen, we need to tune the join operation (indexing the foreign key for example).

2- Business key updates:

REQUIREMENT:

User SCOTT is moving to another company site. His old phone number was 99-1234. His new phone number is 11-6789.
There is two possible scenarios here: old calls from previous number are kept (scenario A-) , or they must be reassigned to the new phone number (scenario B-)

NATURAL KEY MODEL:

A- If business wants to keep the call history assigned to the old phone number, then there is no need to cascade the update:

update PHONELINES set USER_NAME=null where PREFIX_NO=99 and EXTENSION_NO=1234;
insert into PHONELINES (PREFIX_NO,EXTENSION_NO,USER_NAME) = (’11′,’6789′,’SCOTT’) ;
commit;

There is no foreign key lock issues here because the CALLS foreign key is indexed: as it is part of the primary key, then the foreign key is the first column of the index.
If it were not the case (non-indexed foreign key) then the delete will wait until there is no concurrent insert in CALLS – even if they don’t concern the same phone number.
So in that case the modification cannot be considered as an OLTP operation and may have to be planned at offline hours to avoid locks.

B- Now, if business wants to move the call history to assign it to the new phone number – as if calls were originated by this new number – then we really have a mutable key problem.
We need to cascade the update to all calls originated by the old number and Oracle (as well as most other RDBMS) does not have efficient ways to do that.

insert into PHONELINES (PREFIX_NO,EXTENSION_NO,USER_NAME) = (’11′,’6789′,’SCOTT’) ;
update CALLS set PHONELINE_PREFIX_NO=11 and PHONELINE_EXTENSION_NO=6789 where PHONELINE_PREFIX_NO=99 and PHONELINE_EXTENSION_NO=1234;
commit;

That is a big query that may have performance or concurrency issue if run online. It should be planned as an offline operation.
It may even have to be considered as a maintenance operation, such as when the phone prefix plan changes do to company reorganization.

SURROGATE KEY MODEL:

A- If business wants to keep the call history assigned to the old phone number, then the new number is seen as a new phone line:

update PHONELINES set USER_NAME=null where PREFIX_NO=99 and EXTENSION_NO=1234;
insert into PHONELINES (PHONELINE_ID,PREFIX_NO,EXTENSION_NO,USER_NAME) values (PHONELINE_SEQUENCE.NEXTVAL,’11′,’6789′,’SCOTT’) ;
commit;

B- If business want to move the call history to assign it to the new phone number, then the phone line do not change and only its phone number is changed:

update PHONELINES set PREFIX_NO=11 ,EXTENSION_NO=6789 where USER_NAME=’SCOTT’;
commit;

Both operations are quick as they do not update the primary key (PHONELINE_ID). They can be normal OLTP operations.

CONCLUSIONS:

Surrogate keys are necessary when the business key is not static enough (static as the with the ‘{frozen}’ UML property)

But, if the updates to the business key are rare and can be considered as an offline batch operation, or as a maintenance operation, then business key can still be chosen as the primary key.

However, we have seen 2 scenarios here, and we need more information from business to make the choice.

The fact that we cannot identify a static natural key may point to the fact that we have done an incomplete functional analysis.
In our example, we can imagine that we must distinguish the phone number from the physical phone line, and have an association table that records the assignation of a phone number to a phone line. If we introduce a surrogate key here, we hide the fact that the surrogate key is in reality a new business entity – the phone line – distinct from the phone number.

The risk here is that one day business needs to manipulate that entity (as it has a business meaning) and then it is not a surrogate key anymore.

=> The rule is to introduce a surrogate key when the natural key can change during online activity.
=> However, it is a good practice to check first if that surrogate key does not hide a business concept that was forgotten during functional analysis.

3- Candidate key values are unknown at object creation (at insert):

REQUIREMENT:

We need to create a phone line, and have calls from it, even before a phone number is assigned to it.

NATURAL KEY MODEL:

Primary key cannot be null. So we must assign a dummy value (that is not a real natural key as it has no business meaning), and then we will have to update it and cascade the update to the calls that have already be recorded. A primary key that is not know at insert time is not a good primary key.

SURROGATE KEY MODEL:

There is no problem here when using a surrogate key. It is always assigned at insert time (usually with a sequence).
We can even create a row with only the surrogate key and update attributes later. But in that case, we cannot use not null integrity constraints on the other columns.

CONCLUSIONS:

Surrogate keys are needed when we don’t have a natural key assigned at the creation of the row.
However, that may reveal an incomplete functional analysis.
In our example, it seems again that there is an ambiguity between phone numbers and phone lines, and we probably need 2 entities (tables) for that.
Phone number is not an attribute of the phone line in that use case. It should be another entity that is associated to a phone line.

=> Surrogate keys must be introduced when natural key is not known at insert.
=> However, it is a good practice to verify first if that surrogate key does not hide a business concept that was forgotten during functional analysis.

4- Partitioning possibilities:

REQUIREMENT:

We need scalability when number of customer increases, and we need to regularly purge old orders.

NATURAL KEY MODEL:

Table CALLS can be partitioned by range of CALL_START_DATETIME so the performance of queries on recent calls is not dependent on the history retention.
All indexes can be local so that purging old calls is a quick operation (truncate partition) without maintaining/rebuilding indexes.

Table CALLS can also be partitioned (or subpartitionned) by hash or list on PHONELINE_PREFIX_NO so that filtering calls from a specific prefix is quick.

SURROGATE KEY MODEL:

Table CALLS cannot be partitioned on PHONELINE_PREFIX_NO. It can be partitioned on PHONELINE_ID but that does not help queries on PREFIX_NO

Table CALLS can be partitioned by range of CALL_START_DATETIME but then the unique index that enforces the primary key (CALL_ID) cannot be local.
Purging old data when having a global index will be less efficient.

CONCLUSIONS:

There are severe restrictions on partitioning when introducing surrogate keys because less business attributes are present in the child table, and because unique index for the surrogate key cannot be local if it is not part of the partition key.

Two oracle features can overcome those problems when using surrogate keys:
- Referential partitioning (available in 11g) to partition on a parent table column.
- ‘UPDATE INDEXES’ clause when doing operations on partitioned tables in order to maintain the global indexes (but has some performance overhead)

=> If not needed for other reasons, the rule is to avoid unnecessary surrogate keys for tables that may be partitioned (small volume is not concerned)
=> If surrogate key is however chosen, some Oracle features can be investigated to limit the disadvantages.

5- Indexes overhead:

REQUIREMENT:

The CALLS table is heavily transactional and inserts on it is a transaction where performance requirements are critical.
In addition, we need indexes to search calls by origin (phone line) or destination (partner phone line)

NATURAL KEY MODEL:

We have 2 indexes on the calls table :
- one for the calling line / call time (PHONELINE_PREFIX_NO, PHONELINE_EXTENSION_NO, CALL_START_DATETIME) to enforce primary key, and to search by call origin
- one for the call partner (PARTNER_PREFIX_NO, PARTNER_EXTENSION_NO, CALL_START_DATETIME) to enforce uniqueness and allow searching by call destination
If both indexes have a b-tree of 2 levels, each insert on CALLS accesses 7 blocks (1 for the table, and, for each index, 2 branch blocks to follow b-tree plus 1 leaf block).

SURROGATE KEY MODEL:

We have 3 indexes on the calls table :
- one on CALL_ID for the primary key
- one for the calling line / call time (PHONELINE_ID_ID,CALL_START_DATETIME) to enforce uniqueness and search by call origin,
- one for the call partner (PARTNER_PHONELINE_ID,CALL_START_DATETIME) to enforce uniqueness and allow searching by call destination.
Here we have one more index to maintain for each inserted rows: 3 more blocks must be accessed which is in this case an additional performance cost of 42%

Besides that, that additional index on the surrogate key has constantly increasing values (as generating them from a sequence is the only way to avoid contention to generate unique id) and all inserts goes to the high end of the b-tree index. That is very good for index space as there is no need to keep free space in other blocks, but that can result in contention on that high end block. A possible solution is the use of reverse index, but then index will need free space in all blocks for new inserts. We have to decide to implement normal or reverse key index for that surrogate key.

CONCLUSIONS:

Surrogate key often introduce another index where natural key index can serve 2 goals: key enforcement as well as business meaning.
In addition when table is partitionned, the index cannot be local. So that this overhead is not scalable.

=> If not needed for other reasons, the rule is to avoid unnecessary surrogate keys when performance requirement for inserting new rows is important.
=> If still chosen, this overhead must be evaluated, the usage of reverse index or not must be analyzed carefully for the index on the surrogate key.

6- Special values handling:

REQUIREMENT:

We have calls that are not assigned to any partner phone lines:


PHONE 99-1234 dials an erroneous number : ‘1234-12′


PHONE 11-6789 calls his voicemail

NATURAL KEY MODEL:

Those numbers are not real phone lines, so they cannot have a key that has a business meaning.
We can have null value for the partner phone number:
insert into CALLS ( PHONELINE_PREFIX_NO, PHONELINE_EXTENSION_NO, CALL_START_TIME, PARTNER_PREFIX_NO, PARTNER_EXTENSION_NO, DURATION )
values (’99′,’1234′,to_date(’01-JAN-2007 08:30:00′,’dd-mon-yyyy hh24:mi:ss’), null , null , 3);

But we have only one special value – NULL – that means that number is unknown.
And in addition those calls will not be indexed with a regular index on (PARTNER_PREFIX_NO,PARTNER_EXTENSION_NO)

If we want to record the call to voicemail (and not have it as unknown also) then we need to introduce a special phone line when we create our PHONELINES table:

insert into PHONELINES (PREFIX_NO,EXTENSION_NO,USER_NAME) values (’XX‘,’VMAIL‘,null)

And then the call to the voicemail is recorded as:

insert into CALLS ( PHONELINE_PREFIX_NO, PHONELINE_EXTENSION_NO, CALL_START_TIME, PARTNER_PREFIX_NO, PARTNER_EXTENSION_NO, DURATION )
values (’99′,’1234′,to_date(’01-JAN-2007 08:35:00′,’dd-mon-yyyy hh24:mi:ss’), ‘XX‘, ‘VMAIL‘, 60);

But that is not anymore natural key as it has only a technical meaning.

SURROGATE KEY MODEL:

Surrogate keys, as they have only a technical meaning can have infinite special values that are set when we create the PHONELINES  table. We add a SPECIAL_NAME column on PHONELINES to differentiate them:

insert into PHONELINES (PHONELINE_ID,PREFIX_NO,EXTENSION_NO,USER_NAME,SPECIAL_NAME) values (-1,null,null,null,’error’)
insert into PHONELINES (PHONELINE_ID,PREFIX_NO,EXTENSION_NO,USER_NAME,SPECIAL_NAME) values (-2,null,null,null,’voicemail’)

Then CALLS can reference those values as any other phone line.

CONCLUSIONS:

A natural key must have a business signification for all values. Only NULL (that has the business meaning of ‘unknown’) is allowed.
If there is a need to have several technical values, that have no business meaning, then a surrogate key can be helpful.

But, that may also come from an incomplete functional analysis.
First, with the surrogate key we have introduced a new column to differentiate those values. That may hide the fact that the PHONELINE covers different business concepts.
In addition, what if the dialed number do not fit in the phone number format (2 + 4 digits) ? Or if a future evolution of the system allows external calls, with another digit pattern ?
In our case, we probably have 2 business concepts: the internal phone line, and the dialed number.

However, choosing surrogate keys to group similar business concepts can be a good choice in datawarehouse star schemas, in order to limit the number of dimensions.
Inheritance implementation is also a case where we need one key to cover different entities. An example is one table that group all company actors (suppliers, customers, employees) which have a different natural key.

=> Surrogate keys are useful to record several special (technical) values for one entity, or to implement all classes in an inheritance tree into one table.
=> However, it is a good practice to verify first if that surrogate key does not hide a business concept that was forgotten during functional analysis.

7- Surrogate keys visible to users:

REQUIREMENT:

A web application displays the call history, using an URL that gets directly to the phone line call history

NATURAL KEY MODEL:

URL will be something like ‘http://application.intranet.com/showCalls?phone_prefix_no=99&phone_extension_no=1234′

The user can bookmark that link, send it by mail to a colleague, etc. The URL manipulates only business values.

SURROGATE KEY MODEL:

URL will be something like ‘http://application.intranet.com/showCalls?phone_id=0026874′

The user now manipulates blindly a value that should not be used outside of the database.
In the case where a user moves and changes his phone number, then the bookmarked URL will point to incorrect data.

This is not a rare example. We all have bookmarks in our navigator that points to a ‘404: NOT FOUND’ page because the page location has changed: page location was internal to the web application system, but were made visible outside to end users.

This is an example with URL. It is also frequent to see surrogate keys printed in some reports, on product labels or identification cards.

CONCLUSIONS:

When using surrogate keys, there is a risk that, one day, the key is visible outside the system, and then becomes a natural key.
Business often needs to identify its objects. If business has not been able to define a natural key, but need to have an identifier, we can see situations where they will use the surrogate key. Then, that key has both disadvantages of surrogate key and natural key.

If business do not have a natural key, but needs one, then we can generate one from a sequence, and make it visible to the end user. however, this is not a surrogate key. it is a real natural key that is assigned by the system, but then can be manipulated outside of it. An example is a customer number: the system checks if the customer is an existing one (same name, same address, etc…) and generate a value for new ones. Customer number is a natural key that has a business meaning, it can be exchange between several systems (customer care, orders, billing, accounting, etc), The customer is aware if his customer number.
In this example, Business had no natural key (as name, address, email, etc. are not good natural keys: they are not identifying, and are mutable). So the IT systems creates a natural key for them.

=> The rule is to introduce surrogate key only for technical reason, and not because business need an identifier that they can’t define.
=> It is a good practice to verify first if that surrogate key does not hide the need for a new business attribute, needed for the whole information system (IT and non IT).

8- Identifying natural key:

REQUIREMENT:

From functional analysis, we have difficulties to define a real business key. For example, we are not sure that (phone line + call time) uniquely identifies a call when using special services such as conference calls (one phone line can call several numbers at the same time, and each one are a different call).

NATURAL KEY MODEL:

That is a real problem as we must be sure that primary key is unique. And changing the primary key definition concerns several tables, so it is a very deep change.
If a future evolution need to accept 2 calls that can be issued from the same phone at the same time then we need to revisit a big part of our data model, and related software components.

SURROGATE KEY MODEL:

As the referential integrity do not depend on the business key, we can accept badly defined business keys, or future evolution of the business concept definitions.
If 2 calls can be issued from the same phone at the same time, then only the CALLS table structure will change.

CONCLUSIONS:

Surrogate keys can help when business candidate keys are not properly identified. However, business usually need one day to identify the objects they define, and there is a risk when using surrogate keys that it hides some errors done during analysis.

For example, if business do not have a real mean to identify calls, then one day they will probably use our surrogate key to identify them: first within the system, later with external systems, and finally with the end user.

=> Candidate keys are often difficult to identify, and surrogate keys can be introduced.
=> However, that must not be used as an easy way to design on bad specifications.

9- Merging different systems:

REQUIREMENT:

Each company site have the same phone software, but deployed at each company sites, with their own databases.
Now that the network bandwidth is enhanced, and IT department wants to merge all those systems into one database.

NATURAL KEY MODEL:

As natural keys have a meaning outside the system, there is no problem to merge calls from the different company sites. There will not be duplicates as phoneline prefix is different among sites. This maintenance operations involves the following operation on the target database:

insert into PHONELINES select * from PHONELINES@DBLINK_SITE1 union all select * from PHONELINES@DBLINK_SITE2 …
insert /*+ append */ into CALLS select * from CALLS@DBLINK_SITE1 union all select * from CALLS@DBLINK_SITE2 …
commit;

Then change the sequence current values to the maximum existing ones.

That is straightforward and has optimal performance.

SURROGATE KEY MODEL:

Here each system has generated its surrogate keys from its own sequence. We cannot merge that as-is because we will have duplicates. We need to re-numerate the surrogate keys.
- import PHONELINES assigning a new PHONELINE_ID from the sequence
- join the source CALLS table with the PHONELINES table to translate surrogate key into PHONE_PREFIX_BO and PHONE_EXTENSION_NO
- join the result with the target PHONELINES table to get the new PHONELINE_ID associated with PHONE_PREFIX_BO and PHONE_EXTENSION_NO
- insert that result to target CALLS table

That usually introduces a step where identifying a natural key is needed to be consistent between the systems.

CONCLUSIONS:

Replicating data from different system is a complex operation when using surrogate keys because those keys are relevant to one system only.
That operation is not a frequent online operation, but same issue is encountered in several real-life situations:
- merge several systems into one
- distribute application in several systems
- share data with external systems
- feed a datawarehouse from different operational systems
For those operations, surrogate keys must be resolved to natural keys before data is transported.
=> If not needed for other reasons, adding surrogate keys adds complexity to several operations that transport data between different systems.
=> If surrogate key was introduced because of the difficulty to identify a business identifier, that difficulty will need to be solved for those operations.

10- Natural key evolution:

REQUIREMENT:

Our system evolves to manage several companies:
We need to add a COMPANY_CODE to the phone number to properly identify a phone.

NATURAL KEY MODEL:

We need to add COMPANY_CODE column to the PHONELINES table, and that column will be part of the key. Probably referencing a COMPANIES table.
As a consequence, all child tables must add the company code in their foreign keys: several DDL changes, and it increases the key length.
The good thing is that we can partition those table by the company code to get same performance as when they were multiple databases.

SURROGATE KEY MODEL:

We add COMPANY_CODE, but it is not part of the key, so there is no DDL changes to cascade. As we are using surrogate keys, we will probably add a COMPANIES table with COMPANY_ID surrogate key , and COMPANY_CODE non-key column. So the drawback is that we will have one more table to join if we use a predicate on company code. And no possibility to use partitioning on that.

CONCLUSIONS:

When a new attribute is introduced in the business key, the change is more complex when using natural keys.
But that evolution lead to disadvantages at both sides:
- the number of columns for composite natural keys, and the number of changes in the data model.
- the number of joins when using surrogate keys, the lack of possibility to partition child tables on that attribute.
As those points are some of the parameters that influence the decision on surrogate vs. natural key, the the decision should be re-analyzed.

=> Surrogate keys can be introduced for flexibility in schema evolution – if not prevented for other reasons.
=> However, in both cases, we need to investigate the consequences on other concerns.

11- Composite key:

REQUIREMENT:

Our system has a high depth of referential integrity.
We already added the company code to the phone line, and in addition to the model described above, one call has several services associated to it, each service has a rate that depends on the duration, so we have to detail calls into call services and call services into call service time buckets.

NATURAL KEY MODEL:

Each master-detail relationship adds a new column to the key. the CALL_SERVICE_TIME_BUCKET table will have the following primary key:
PHONELINE_PREFIX_NO,PHONELINE_EXTENSION_NO,CALL_DATETIME,SERVICE_CODE,BUCKET_NUMBER
Even if compound keys are not a problem by themselves, we can go to a limit where sql queries will be too complex to read, where indexes will be very large, and where joining tables on those columns will be less effective that with a single number.

There is no limit, but having 5 or more large columns can be bad.

SURROGATE KEY MODEL:

The key is always one single numeric column. However, composite keys are still needed for association tables that implement the many-to-many relationship.

CONCLUSIONS:

The complexity of composite keys must be analyzed to take a decision about natural or surrogate key.
=> Surrogate key may need to be introduced at some level to avoid composite natural keys become too large.

12- Cost Base Optimizer:

REQUIREMENT:

We have very different distribution of calls from the different sites (recognized by the prefix number)
and we need optimized performance for all queries.

NATURAL KEY MODEL:


We create histograms on PHONELINE_PREFIX_NO columns, so that the Oracle
optimizer can estimate the accurate cardinality for queries that are
done on a specific prefix.

SURROGATE KEY MODEL:


We can create that kind of histograms only on PHONELINES, that will
show the distribution of phone lines among company sites, but not the
distribution of calls.

CONCLUSIONS:

Natural keys have more information about the data that can be used by the optimizer.
Even without histograms, the CBO has more information when the
appropriate datatype is used and composite keys have statistics for all
part of them. So the complexity we have with composite keys can be a
real advantage for the optimizer.

=> If not needed for
other reasons, natural keys may be preferred to give more information
about data to the Oracle optimizer.
=> If surrogate key is introduced to replace a large composite key,
we must check if we don’t hide important information (metadata) to the
optimizer.

13- Storage (Row Size):

REQUIREMENT:

Our system has to store a large call history and we want to avoid unnecessary storage, so we need to have the minimal rows length.

NATURAL KEY MODEL:

PHONELINES has the minimal size as it records only business columns
CALLS has however to store the foreign key (PHONELINE_PREFIX_NO,PHONELINE_EXTENSION_NO) that is 6 bytes.

SURROGATE KEY MODEL:

PHONELINES has an additional column for the foreign key, and an additional index on it.
If we have less than one thousand phone lines it takes only 3 bytes per row, so the overhead is not so important.
CALLS replaces the natural foreign key with the surrogate key that is a number. If we have one million calls, the number takes less than 4 bytes, instead of the 6 bytes for the primary key. Here again the gain is not so important.

CONCLUSIONS:

The difference on the average row size is not very important, and it depend on the size of the natural key, the depth of referential integrity, etc.
=> The size of rows is not a real argument to choose surrogate or natural key, except special cases (natural key very large, and big detail table for example)

14- Number of possible key values :

REQUIREMENT:

We need to be sure that there is no limit on the number of calls to store.

NATURAL KEY MODEL:

There is no limit: all values are possible as they have all a business meaning.

SURROGATE KEY MODEL:

A number can store 38 digits. No limit will be exhausted even when generating billions of numbers per seconds during hundred of years.

CONCLUSIONS:

=> number of values for the key is not an argument.

15- Data modeling tools:

REQUIREMENT:

We want to use some data modeling tool that do not support composite keys very well.

NATURAL KEY MODEL:

We may have issues as natural key is often composed of several columns.

SURROGATE KEY MODEL:

We still have issues as we cannot avoid composite keys for many-to-many association tables.

CONCLUSIONS:

Composite primary keys cannot be avoided even when using surrogate keys as much as possible.
In addition, the data model should be driven by business needs and performance requirements, not by the features of a tool .
=> This cannot be real argument.

16- SQL complexity:

REQUIREMENT:

We want our code to be easily readable, avoiding complex sql queries.

NATURAL KEY MODEL:

We have less joins, but when we have joins, they are more complex (several columns). In addition, checking that natural key is not updated during online transactions adds a little complexity.

SURROGATE KEY MODEL:

We have simpler joins, but more joins. In addition, checking that surrogate keys are never shown to the end user adds some complexity.

CONCLUSIONS:

Both have advantages and disadvantages.
=> The SQL complexity is not a real argument.

Conclusion

Conclusion

NATURAL KEYS
are chosen when:
  1. There are business attributes that identifies the business object instance
  2. Those attributes are known during the whole business object life cycle (from its creation)
  3. Those attributes values do not change during the whole business object life cycle – or it is exceptional.
  4. Those attributes are used to access to specific business objects (and then are used in indexes, partition keys)
  5. For each business object instance, the value of those attributes has a meaning for business.
  6. The data has to be exchanged among different system.
  7. We don’t expect future evolution on the business object definition (and the key definition).
  8. The key is composed with few attributes (5 starts to be high)
SURROGATE KEYS
are chosen when:
  1. No business attributes identifies the business concept, and business do not need an identifier.
  2. The business identifier is not static during the object life cycle
  3. The business identifier values must also include special technical (non-business) values
  4. The functional specification cannot be made stable enough to limit future evolution of the business key
  5. The surrogate key will never be visible outside if the system (in prints, reports, external systems, datawarehouses, etc)
  6. There is no critical performance requirements to query that object

This is a case-by-case analysis, and some compromises may have to be done.
If a compromise has to be done, some RDBMS features can be implemented to limit the drawbacks


If surrogate key seems to be necessary, it is often a good idea to iterate on the functional specifications to be sure that it does not hide a missed business concept.


The goal the document was to list all concerns related with the choice of natural or surrogate key, so that the choice is not done by a systematic ‘rule of thumb’, but with a complete knowledge of the consequences.

References

  • Steve Adams
Considering Synthetic Keys and Reverse Key Indexes
http://www.ixora.com.au/tips/design/synthetic_keys.htm
  • Thomas Kyte
AskTom answer on “Surrogate versus Natural Keys”
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:689240000346704229
  • Jonathan Lewis
Meaningless Keys
http://jonathanlewis.wordpress.com/2006/12/29/meaningless-keys/
  • Niall Litchfield
Re: The Case Against Compound/Natural Keys
http://www.freelists.org/post/oracle-l/The-Case-Against-CompoundNatural-Keys,2
  • Ralph Kimball
Surrogate Keys [in datawharehouses]
http://www.dbmsmag.com/9805d05.html
  • Peter Scott
Surrogate keys in Data Warehousing
http://www.rittmanmead.com/2005/06/15/surrogate-keys/
  • Scott W. Ambler
Choosing a Primary Key [Agile Data Method]
http://www.agiledata.org/essays/keys.html


All tests and documentation that helped building this document were related to 10gR2 Oracle version.
All comments are welcome at
contact@pachot.net

Posted in data modeling | Tagged: | Leave a Comment »