How to perform SCD Type2 on all columns using SQL query

Ram Teja Y
4 min readJan 24, 2021

--

Introduction:

This article is for developers looking to implement SCD type 2 using SQL querying or for students/freshers looking to learn or know about the same. Code is provided and can be re-used with tweaks as per your table structure. SCD, for people who don’t know stands for Slowly Changing Dimension. It basically captures changes in dimensional data (most common examples are customer, product and many more based on type of business) and also maintains versions before change. Updating the changed value alone falls under type 1 and versioning them is type 2, there are other types too. Capturing change and maintaining the history can be very useful in master data management and also for business analytics in cases.

Steps with Example:

Assume our Product table looks like this as of now

First step to implement SCD2 is to create additional fields in your table which will help describe the changes in future. In my case, I will create 3 new fields as below:

Current_Flag: This field will contain Y/N values to let you know if the record is an older version of a changed record or a new record.

End_Date: This field will contain a date value which can be helpful to know when data in a record has been changed. If its an active record I will be showing a future date i.e. 1st Jan 2050.

Changed_Field: Since we are going to capture change in all fields, i have kept a field to show on which fields has change happened. This will basically contain the list of all fields that have been changed concatenated with any delimiter.

Query for adding fields required for SCD2 implementation:

alter table schema.Dim_Product add column Current_Flag varchar(1);
alter table schema.Dim_Product add column End_date date;
alter table schema.Dim_Product add column Changed_Field varchar(100);

After these 3 fields have been added to the table, lets update the default values to these fields using below query

update schema.Dim_Product set
Current_Flag=’Y’
,End_date=’2050–01–01'
,Changed_Field=’NA’;

After doing this, we have to write SQL queries which should consider two conditions. One is to insert the newly entered records and the other is to update the existing records in the table accordingly so that it is marked as a changed record and insert the changed record from the source as such. Let us assume that the new data looks like below.

New Data from source

If you notice, the first two records are changes to existing data and the other two are absolute new entries. We have to handle both these cases in the query we are going to write below. Let us assume we’ve captured new data from source into a table called schema.Stg_Product .

Query for inserting new records (1008,1009 items in our example):

insert into schema.Dim_Product
(
select b.*,’Y’,’2400–01–01',null from schema.Dim_Product a right join schema.Stg_Product b
on nvl(a.Product_ID,’-1')=nvl(b.Product_ID,’-1') and a.current_flag=’Y’
where a.Product_ID is null
);

The first case has been handled, let us know handle the other case.

Create a temporary table which contains only the changed records (1004,1005 in our example).

create temp table #product_temp
select a.Product_ID as Product_ID_New,
case when a.Category_Name <> b.Category_Name then ‘-Category_Name’ else ‘’ end ||
case when a.Sub_Category_Name <> b.Sub_Category_Name then ‘-Sub_Category_Name’ else ‘’ end ||
case when a.Brand <> b.Brand then ‘-Brand’ else ‘’ end ||
case when a.Feature_Desc <> b.Feature_Desc then ‘-Feature_Desc’ else ‘’ end as CHANGED_COLUMN_NEW
from schema.Dim_Product a join schema.Stg_Product b
on a.Product_ID=b.Product_ID and a.current_flag=’Y’
where
a.Category_Name <> b.Category_Name or
a.Sub_Category_Name <> b.Sub_Category_Name or
a.Brand <> b.Brand or
a.Feature_Desc <> b.Feature_Desc;

Update the old records to flag them as non-current, end date and show which field has change.

update schema.Dim_Product set current_flag=’N’,end_date=current_date, changed_column= changed_column_new
from #product_temp
where Product_ID=Product_ID_New and current_flag=’Y’;

Insert the newly changed records as current ones and drop the temporary table.

insert into schema.Dim_Product
(select *,’Y’,’2400–01–01',null from schema.Stg_Product
where Product_ID in (select distinct Product_ID_New from #product_temp));

drop table #product_temp;

You can write a similar set of queries for your case and create a simple SQL procedure and add a condition to trigger it once the Staging data gets loaded and your done! Note that creating new columns and default value update is a one time step.

The final output will look like this. The red shade indicates changed records with older values. The lighter green share indicates the corresponding new records for changed values and the dark green shade indicates absolutely new entries.

Conclusion:

These steps can also be achieved using any ETL tool as well but writing your own query gives you power and control over your process and also a full understanding of the complete process involved. Hope this helped you in some way! Have fun with data!

--

--

Ram Teja Y
Ram Teja Y

No responses yet