Monday, November 11, 2013

Surrogate key

Surrogate key

  • What is surrogate key?
  • Where do we use surrogate key?
  • How it is different from natural key?
  • How do we implement surrogate key?
  • What is the advantage of using surrogate key?


  • What is surrogate key?

As its name suggests, Surrogate means ‘substitute’.
Surrogate key is just another substitute key, which will be used to uniquely identify the rows in a dimension table. Surrogate key will have no business meaning.
It’s internally generated by system and is invisible to the user or application. Thus its value cannot be modified by user or application.

  • Where do we use surrogate key?

If the table is not having any attribute (or natural key), which uniquely identify a row.
If the attribute used as unique identifier uses much storage than required. For example if the attribute is alphanumeric.
When there is doubt of changes in the values of the attribute (used for unique identification of the row)
If the table participate in more than one relationship from the logical data model. Then data needs to be fetched from multiple related table using joins.
If we want to improve performance of data retrieval from multiple table.


  • Difference between Surrogate key and natural key

Surrogate key
Natural key
Don’t have business meaning
Have business meaning, for e.g. SSN number

These can be changed.
Small in size, generally of integer type. Can be created as identity column
Bigger in size, generally alphanumeric or string,
Comparison takes less time , as it compares to integer, joins are fast
Comparison  takes much time, as it converts string or alphanumeric to Ascii , and then comparison takes place, joins are slow
No locking contentions because of unique constraint, as the surrogates get generated by the DB and are cached – very scalable.

Locking contention may occur, as two sessions may try to insert the same unique business key,
Not searchable from application
searchable
An additional index is needed.  
No additional index required
There are often generated by system .
Records cannot be inserted untill the key is not known, as it has got business meaning



Let me give you a simple, classical example:
On the 1st of January 2002, Employee 'E1' belongs to Business Unit 'BU1' (that's what would be in your Employee Dimension). This employee has a turnover allocated to him on the Business Unit 'BU1' But on the 2nd of June the Employee 'E1' is muted from Business Unit 'BU1' to Business Unit 'BU2.' The entire new turnovers have to belong to the new Business Unit 'BU2' but the old one should belong to the Business Unit 'BU1.'
If you used the natural business key 'E1' for your employee within your datawarehouse everything would be allocated to Business Unit 'BU2' even what actually belongs to 'BU1.'
If you use surrogate keys, you could create on the 2nd of June a new record for the Employee 'E1' in your Employee Dimension with a new surrogate key.
This way, in your fact table, you have your old data (before 2nd of June) with the SID of the Employee 'E1' + 'BU1.' All new data (after 2nd of June) would take the SID of the employee 'E1' + 'BU2.'
You could consider Slowly Changing Dimension as an enlargement of your natural key: natural key of the Employee was Employee Code 'E1' but for you it becomes
Employee Code + Business Unit - 'E1' + 'BU1' or 'E1' + 'BU2.' But the difference with the natural key enlargement process, is that you might not have all part of your new key within your fact table, so you might not be able to do the join on the new enlarge key -> so you need another id.